Observability & TuningTopic 3 of 6

Advanced TracingAdvanced

BPF Maps & Ring Buffer

DockerKubernetesNginx

🧠

Mental Model

A factory floor with different types of storage. Hash maps are filing cabinets -- look up any document by name in constant time, but someone must explicitly remove old files. Arrays are numbered shelves -- slot 0 through slot N always exist, instant access by number. Per-CPU maps give every worker their own private clipboard -- no waiting in line to write. LRU maps are filing cabinets with a janitor who throws out the least-touched documents when space runs low. The ring buffer is a conveyor belt from the factory floor to the loading dock -- items placed on the belt arrive in order, and the belt keeps moving whether or not someone is picking up items at the other end.

💡

The Problem

An eBPF-based security monitoring tool running on production hosts starts losing events when system load exceeds 50,000 syscalls per second. The tool uses perf_event_output to send events from kernel space to a userspace daemon. At low load, everything works. Under sustained pressure, the perf ring buffer overflows because each CPU has its own buffer, the userspace reader cannot poll all of them fast enough, and events are silently discarded. Switching to the BPF ring buffer -- a single shared buffer with a lock-free producer-consumer protocol -- eliminates the per-CPU fragmentation, reduces memory usage by 60%, and handles 200,000 events per second with zero drops.

Architecture

An eBPF program runs inside the kernel, observing every packet or syscall. It discovers something important -- a connection, an anomaly, a counter that needs incrementing. Now what? The program cannot call printf. It cannot write to a file. It cannot allocate heap memory. It lives in a sandbox where the verifier has stripped away almost every freedom a normal program enjoys.

BPF maps are the answer. They are the data structures that bridge the gap between eBPF programs running in kernel context and userspace applications that need the data. Every serious eBPF deployment -- Cilium, Falco, bpftrace, Katran -- depends entirely on choosing the right map type for the job.

The Map Abstraction

A BPF map is a kernel-resident data structure created via the bpf() syscall with the BPF_MAP_CREATE command. It has a type, a fixed key size, a fixed value size, and a maximum number of entries. Once created, both BPF programs (in kernel space) and userspace applications can read and write to it through well-defined interfaces.

From the BPF program side, map access uses helper functions:

void *bpf_map_lookup_elem(struct bpf_map *map, const void *key);
long  bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, __u64 flags);
long  bpf_map_delete_elem(struct bpf_map *map, const void *key);

From userspace, the same operations go through the bpf() syscall:

bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));

The verifier checks every map access at program load time. It ensures the key and value sizes match, that lookup results are null-checked before dereferencing, and that map operations happen only on maps the program is authorized to access.

Hash Maps

BPF_MAP_TYPE_HASH is the general-purpose key-value store. Keys can be any fixed-size blob -- a 4-byte IP address, a 13-byte connection 5-tuple, a 32-byte struct. The kernel implements it as a hash table with per-bucket spin locks.

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 262144);
    __type(key, struct conn_key);     /* 13 bytes: IPs + ports + proto */
    __type(value, struct conn_value); /* 32 bytes: counters + state */
} conntrack SEC(".maps");

Lookup is O(1) average case. Insert allocates memory on the first write to a key and fails with -E2BIG when max_entries is reached. Delete frees the entry. The per-bucket spin locks mean that concurrent access to different buckets proceeds in parallel, but two CPUs hitting the same bucket serialize.

For Cilium's connection tracking, this is the workhorse. Every TCP and UDP flow gets an entry keyed by 5-tuple. NAT decisions, policy verdicts, and byte counters live in the value. With 256K max_entries, the map handles a busy Kubernetes node comfortably.

Array Maps

BPF_MAP_TYPE_ARRAY is a fixed-size array where keys are integers from 0 to max_entries - 1. All entries are pre-allocated at map creation time, so lookups never fail for valid indices (the pointer returned is always non-NULL).

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1024);
    __type(key, __u32);
    __type(value, struct config_entry);
} config SEC(".maps");

Arrays are ideal for configuration data pushed from userspace (index 0 = sampling rate, index 1 = log level) and for lookup tables where the key is a small integer. They cannot have entries deleted -- once allocated, all slots exist for the map's lifetime.

Per-CPU Maps

The per-CPU variants -- BPF_MAP_TYPE_PERCPU_HASH and BPF_MAP_TYPE_PERCPU_ARRAY -- give each CPU core its own private copy of every value. This is not an optimization. For high-frequency counters, it is a requirement.

Consider a packet counter updated on every received packet. On a 64-core machine processing 10 million packets per second, a shared counter means 10 million atomic increments per second, all bouncing the same cache line across 64 cores. The cache coherency protocol (MESI/MOESI) turns this into a serialization point where most of the time is spent waiting for cache line ownership.

Per-CPU maps eliminate all of this. Each core increments its local copy with a simple store instruction. No atomics. No cache-line bouncing. No spin locks. The cost is that userspace reads back NR_CPUS copies and must sum them:

int ncpus = libbpf_num_possible_cpus();
struct vip_stats values[ncpus];
bpf_map_lookup_elem(map_fd, &key, values);

__u64 total = 0;
for (int i = 0; i < ncpus; i++)
    total += values[i].packets;

Katran uses per-CPU arrays for exactly this pattern. Each VIP index holds packet and byte counts that each core updates independently. Userspace polls every second and aggregates.

LRU Hash Maps

BPF_MAP_TYPE_LRU_HASH solves the stale entry problem. A regular hash map that hits max_entries rejects all further inserts. For a connection tracking table, this means new connections are dropped once the table is full, even if most entries are for connections that closed hours ago.

LRU maps automatically evict the least recently accessed entry when space runs out. The implementation uses per-CPU LRU lists to reduce contention, with a global list as a fallback. Eviction is approximate -- under extreme churn, a recently accessed entry might get evicted if it lands on a CPU's local list that happens to be under pressure.

struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 100000);
    __type(key, struct rate_key);
    __type(value, struct rate_value);
} rate_limiter SEC(".maps");

Size LRU maps at 2-3x the expected steady-state working set. At exactly the working set size, every burst causes eviction storms that remove entries still in active use.

The Ring Buffer

Before Linux 5.8, streaming events from BPF programs to userspace meant BPF_MAP_TYPE_PERF_EVENT_ARRAY. This creates one ring buffer per CPU. Each BPF program calls bpf_perf_event_output() to push data into the calling CPU's buffer. Userspace must open and poll all N buffers independently.

The problems with this design surface under production load:

Memory waste. Each per-CPU buffer must be sized for that CPU's peak event rate. On a 128-core machine, sizing each buffer at 64 KB uses 8 MB total. But if events are bursty and concentrated on a few CPUs, most of that 8 MB sits unused while the hot CPUs overflow.
Event loss under asymmetric load. CPU 47 handles an interrupt storm, fills its buffer, and drops events. CPU 48 through 127 sit idle with empty buffers. The total system has plenty of buffer capacity, but it is stranded on the wrong CPUs.
Userspace complexity. The consumer must epoll_wait() on N file descriptors and handle events from each CPU independently, often with ordering challenges.

BPF_MAP_TYPE_RINGBUF (Linux 5.8+) replaces all of this with a single shared ring buffer:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  /* 256 KB shared across all CPUs */
} events SEC(".maps");

The ring buffer uses a lock-free reserve-commit protocol. A BPF program reserves space in the buffer with bpf_ringbuf_reserve(), writes data into the reserved region, and commits it with bpf_ringbuf_submit(). Multiple CPUs can reserve and commit concurrently without locks -- the implementation uses atomic compare-and-swap on the producer position.

Userspace consumes events through a single file descriptor:

struct ring_buffer *rb = ring_buffer__new(map_fd, callback, NULL, NULL);
while (1)
    ring_buffer__poll(rb, 100);  /* 100ms timeout */

Events arrive in commit order (not necessarily CPU order), and the single buffer means total capacity is shared across all CPUs. A 256 KB ring buffer handles sustained throughput that would require 8 MB of per-CPU perf buffers on the same machine.

Map Pinning and Lifetime

By default, a BPF map exists as long as at least one BPF program or file descriptor references it. When the last reference is dropped, the map and all its data are destroyed.

For maps that hold persistent state -- connection tracking, counters, configuration -- this is unacceptable. Restarting the Cilium agent would wipe all connection tracking entries, causing thousands of active connections to reset.

Map pinning solves this. Pinning a map to bpffs creates a filesystem entry at /sys/fs/bpf/<name> that holds a reference to the map:

# Pin a map
bpftool map pin id 42 /sys/fs/bpf/conntrack_map

# Later, retrieve the pinned map
bpftool map show pinned /sys/fs/bpf/conntrack_map

From code:

// Pin
bpf_obj_pin(map_fd, "/sys/fs/bpf/conntrack_map");

// Retrieve
int fd = bpf_obj_get("/sys/fs/bpf/conntrack_map");

The pinned map survives program unload and reload. A new version of the BPF program can attach to the existing map and pick up right where the old version left off. Cilium relies on this for zero-downtime datapath upgrades -- the agent restarts, reloads programs, and reconnects to pinned maps without losing a single connection tracking entry.

Inspecting Maps at Runtime

bpftool is the primary tool for inspecting BPF maps on a running system:

# List all maps
$ bpftool map list
42: hash  name conntrack  flags 0x0
    key 16B  value 32B  max_entries 262144  memlock 12582912B
    pinned /sys/fs/bpf/conntrack_map
57: percpu_array  name vip_counters  flags 0x0
    key 4B  value 16B  max_entries 4096  memlock 33554432B
63: ringbuf  name events  flags 0x0
    max_entries 262144  memlock 266240B

# Dump a hash map
$ bpftool map dump id 42
key: 0a 00 01 05 0a 00 02 0a  1f 90 c3 50 06 00 00 00
value: 00 00 00 00 00 00 03 e8  00 00 00 00 00 18 6a 00 ...

# Dump per-CPU values (shows each CPU's copy)
$ bpftool map lookup percpu id 57 key 0 0 0 0
key: 00 00 00 00
value (CPU 00): 00 00 00 00 00 00 41 a3  00 00 00 00 00 3c f2 80
value (CPU 01): 00 00 00 00 00 00 3e 22  00 00 00 00 00 38 b1 40
value (CPU 02): 00 00 00 00 00 00 42 f1  00 00 00 00 00 3f 01 c0
...

For debugging event loss, check the ring buffer's consumer position against the producer position. If the consumer is falling behind, events will be dropped when the buffer wraps around. The fix is either a larger buffer or a faster consumer (typically by moving slow I/O out of the consumption callback).

Common Questions

When should a hash map be used instead of an array map?

Use hash maps when entries are created and destroyed dynamically (connection tracking, process monitoring) or when keys are not small integers (IP addresses, 5-tuples, strings). Use array maps when the key space is a contiguous range of integers and all entries should exist for the map's entire lifetime (configuration tables, histograms with fixed bucket counts, program state).

How much memory does a per-CPU map actually use?

NR_CPUS * max_entries * value_size for arrays, plus overhead. On a 128-core machine, a per-CPU array with 10,000 entries of 16 bytes each uses 128 * 10,000 * 16 = ~20 MB. This is often acceptable given the contention it eliminates, but it can surprise operators who see memlock values much larger than expected in bpftool map list.

Can BPF maps be shared between multiple BPF programs?

Yes. Multiple BPF programs can reference the same map if they are loaded with the same map file descriptor or if they attach to a pinned map. This is how Cilium coordinates between its XDP, tc, and cgroup BPF programs -- they all share the same connection tracking and policy maps.

What happens when a ring buffer is full and a BPF program tries to write?

bpf_ringbuf_reserve() returns NULL. The BPF program must check for this and handle it -- typically by incrementing a drop counter in a separate map and returning. Events are lost. The fix is a larger ring buffer or a faster consumer, not a retry loop (BPF programs cannot block).

How Technologies Use This

Docker

A Docker Swarm cluster running 3,000 containers across 50 nodes enforces network policies that specify which containers can communicate with each other. Using iptables, each node maintains 12,000 rules that the kernel evaluates linearly for every incoming and outgoing packet. At 200,000 packets per second per node, iptables rule evaluation consumes 35% of CPU time, and adding 500 more containers requires 2,000 additional rules that further degrade packet processing throughput.

Cilium, deployed as the container networking plugin, replaces iptables-based policy enforcement with BPF hash maps pinned to the bpffs virtual filesystem. Each network policy rule becomes a key-value entry in a BPF hash map, where the key is a combination of source and destination security identity (derived from container labels) and the value stores the allow/deny verdict. When a packet arrives at a container's veth interface, a BPF program attached to the TC (traffic control) hook performs a single O(1) hash map lookup to determine whether the flow is permitted. The lookup cost is constant regardless of whether the cluster has 100 or 10,000 policy rules.

On a node handling 500,000 concurrent connections across 60 containers, the Cilium BPF hash maps consume approximately 80 MB of pinned kernel memory. Each connection's NAT state, policy verdict, and byte counters are stored in a per-CPU hash map variant, which eliminates lock contention because each CPU core writes exclusively to its own copy of the map. The CPU overhead for policy enforcement drops from 35% under iptables to under 5% with BPF maps, freeing 30% of node CPU capacity for application workloads.

Kubernetes

A Kubernetes cluster with 4,000 ClusterIP services runs kube-proxy in iptables mode on each node. Every service creates approximately 8 iptables rules (pre-routing, output, service chain, endpoint chains), resulting in 32,000 rules per node. When a pod sends a packet to a ClusterIP, the kernel walks these rules sequentially in the nat table to find the matching service and select a backend endpoint. At scale, this linear search adds 1.5 ms of latency per connection setup and consumes 20% of node CPU.

Replacing kube-proxy with Cilium's BPF-based service implementation stores the entire service-to-endpoint mapping in a BPF hash map. The map key is the service ClusterIP and port; the value contains the list of backend pod IPs and the load-balancing state. A BPF program attached at the socket layer (sock_ops and connect hooks) intercepts outgoing connections and performs a single hash map lookup to resolve the service to a backend. For established connections, subsequent packets bypass the lookup entirely because the BPF program rewrites the destination at connect time rather than on every packet.

On a node with 4,000 services and 12,000 backend endpoints, the BPF service map occupies roughly 15 MB of kernel memory. Connection setup latency drops from 1.5 ms (iptables linear walk) to under 10 microseconds (single hash lookup). The CPU savings are proportional to service count: at 4,000 services, the BPF approach uses 95% less CPU for service resolution than iptables mode, and scaling to 10,000 services adds negligible overhead because hash map lookup time remains constant.

Nginx

An Nginx reverse proxy handles 2 million HTTP requests per second across 800 upstream servers. During traffic spikes, certain clients send 50,000 requests per second, overwhelming individual upstream servers. Traditional rate limiting using Nginx's limit_req module operates at Layer 7 after the kernel has already performed TCP handshake processing, socket allocation, and HTTP parsing for every request, wasting CPU cycles on traffic that will ultimately be rejected.

XDP (eXpress Data Path) programs attached to the NIC driver hook process packets before they enter the kernel networking stack. A BPF program at the XDP layer maintains a per-CPU hash map keyed by source IP address, where each value stores a token bucket counter with a last-updated timestamp. When a packet arrives, the XDP program performs a hash map lookup, decrements the token count, and either passes the packet up the stack (XDP_PASS) or drops it immediately (XDP_DROP). Dropped packets never allocate an sk_buff, never enter the TCP state machine, and never consume Nginx worker CPU time. The per-CPU map variant ensures that each core updates its own copy of the rate counter without atomic operations or cache-line contention.

At 2 million requests per second on a 32-core server, the XDP rate limiter processes each packet in approximately 50 nanoseconds, compared to 5 microseconds for the same decision made at Layer 7 inside Nginx. During an attack generating 10 million packets per second from 5,000 source IPs, the XDP program drops 8 million packets per second at line rate while allowing legitimate traffic through. The per-CPU hash map holding 5,000 entries across 32 cores uses 32 * 5,000 * 64 bytes, approximately 10 MB of memory, a negligible cost relative to the CPU savings from not processing millions of packets through the full kernel networking stack.

Same Concept Across Tech

Technology	How it uses BPF maps	Key gotcha
Cilium	Pinned hash maps for conntrack, per-CPU arrays for packet counters, LRU maps for NAT entries	Unpinned maps lose all connection state on agent restart. Always pin to /sys/fs/bpf/
Falco	Ring buffer (previously perf_event_array) for streaming security events to userspace	perf_event_array drops events under asymmetric CPU load. Ring buffer eliminates this
bpftrace	Per-CPU hash maps for @aggregations, arrays for histograms, printf via perf_event_output	Large aggregation maps can hit max_entries limit. Increase with -DBPF_MAP_SIZE
BCC tools	Hash maps for per-process/per-file statistics, arrays for configuration, perf buffers for events	Python bindings add overhead to map reads. For high-frequency polling, use libbpf directly
Katran	Per-CPU arrays for VIP packet/byte counters, hash maps for consistent hashing state	Per-CPU arrays on 128-core machines use 128x memory. Size accordingly

Stack layer mapping (BPF map debugging):

Layer	What to check	Tool
Application	Which map type is being used, and is it the right choice for the access pattern?	bpftool map list, source code review
BPF program	Are map operations succeeding, or returning -ENOENT / -E2BIG?	bpf_trace_printk() on error paths
Map subsystem	Is the map full? Is LRU evicting too aggressively?	bpftool map show id (check max_entries vs current entries)
Ring buffer	Is the consumer keeping up, or is the buffer filling?	bpftool map show, check consumer lag
Kernel	Is bpffs mounted? Are maps pinned correctly?	mount
Hardware	Is per-CPU memory pressure causing allocation failures?	dmesg for BPF allocation errors

Design Rationale eBPF programs run in kernel context where they cannot call arbitrary kernel functions or allocate memory freely. Maps provide the controlled, verifier-checked interface for data storage and communication. The variety of map types exists because no single data structure serves all access patterns. Hash maps handle dynamic key-value lookups. Arrays handle fixed-index access with zero allocation overhead. Per-CPU variants eliminate synchronization for high-frequency writes. The ring buffer solves the event streaming problem that perf_event_array handled poorly -- one shared buffer instead of N independent per-CPU buffers, with a lock-free protocol that lets multiple CPUs write concurrently without coordination.

If You See This, Think This

Symptom	Likely cause	First check
Events dropped under high load	perf_event_output per-CPU buffer overflow on hot CPUs	Switch to BPF_MAP_TYPE_RINGBUF, or increase per-CPU buffer size
CPU usage spikes on map-heavy BPF program	Shared (non-per-CPU) map with high write contention	Switch to per-CPU map variant, check bpftool map show for type
Map insert returns -E2BIG	Map reached max_entries limit	Increase max_entries or switch to LRU map for auto-eviction
Connection state lost after program restart	Maps not pinned to bpffs	Pin maps with bpf(BPF_OBJ_PIN) to /sys/fs/bpf/, verify with ls
LRU map evicts entries that are still active	max_entries too close to working set size, eviction too aggressive	Size LRU map at 2-3x expected working set
Userspace reads stale counter values	Reading per-CPU map but not summing all CPU copies	Use bpftool map lookup percpu, sum all CPU values in application
Ring buffer consumer falls behind	Synchronous I/O in consumer callback blocks event processing	Consume into in-memory queue, drain asynchronously

When to Use / Avoid

Relevant when:

Building or debugging eBPF programs that communicate state between kernel and userspace
Diagnosing event loss in eBPF-based monitoring tools (Falco, Cilium Hubble, custom probes)
Choosing between perf_event_output and BPF ring buffer for event streaming
Optimizing high-frequency counters in XDP or tc programs (per-CPU vs shared maps)
Understanding why a BPF map runs out of space or why entries disappear (LRU eviction)

Watch out for:

perf_event_output drops events under asymmetric CPU load because per-CPU buffers overflow independently
Shared (non-per-CPU) maps under high write rates cause spin lock contention that dominates CPU usage
Unpinned maps are destroyed on program unload, losing all accumulated state
LRU eviction is approximate and can evict hot entries under sustained pressure near max_entries

Try It Yourself

 1  # List all BPF maps currently loaded in the kernel
 2  
 3  bpftool map list
 4  
 5  # Show detailed info for a specific map
 6  
 7  bpftool map show id 42
 8  
 9  # Dump all entries in a hash map
10  
11  bpftool map dump id 42
12  
13  # Look up a specific key in a map (key in hex bytes)
14  
15  bpftool map lookup id 42 key 0x0a 0x00 0x01 0x01
16  
17  # Look up a per-CPU map entry (shows value on each CPU)
18  
19  bpftool map lookup percpu id 42 key 0x00 0x00 0x00 0x01
20  
21  # Pin a map to bpffs so it survives program restart
22  
23  bpftool map pin id 42 /sys/fs/bpf/my_conntrack_map
24  
25  # List pinned objects on bpffs
26  
27  ls -la /sys/fs/bpf/
28  
29  # Create a hash map from the command line (for testing)
30  
31  bpftool map create /sys/fs/bpf/test_map type hash key 4 value 8 entries 1024 name test_map
32  
33  # Delete a specific entry from a map
34  
35  bpftool map delete id 42 key 0x0a 0x00 0x01 0x01
36  
37  # Show all BPF programs and their map references
38  
39  bpftool prog show
40  
41  # Check if bpffs is mounted
42  
43  mount | grep bpf
44  
45  # Monitor BPF-related kernel messages
46  
47  dmesg | grep -i bpf

Debug Checklist

1List all BPF maps and check types and sizes: bpftool map list
2Dump map contents to verify entries: bpftool map dump id <map_id>
3Check for pinned maps: ls -la /sys/fs/bpf/
4Monitor ring buffer usage: bpftool map show id <ring_buf_id>
5Check for dropped events in perf buffers: perf stat -e bpf:bpf_perf_event_output
6Verify per-CPU map values: bpftool map lookup percpu id <map_id> key <hex_key>

Key Takeaways

✓BPF ring buffer (BPF_MAP_TYPE_RINGBUF) is strictly superior to perf_event_array for event streaming. It uses a single shared buffer instead of per-CPU buffers, which means better memory efficiency (one buffer sized to aggregate throughput, not N buffers each sized for peak per-CPU throughput) and simpler userspace consumption (one fd to poll instead of N).
✓Per-CPU maps are not optional for high-frequency counters. A shared hash map with 10 million updates per second across 64 cores spends more time on spin lock contention than on actual work. Per-CPU variants eliminate all synchronization from the write path. The cost is NR_CPUS copies of each value in memory and a userspace aggregation step on read.
✓LRU hash maps solve the stale entry problem that plagues long-running BPF programs. A connection tracking map without eviction grows until it hits max_entries and then fails all inserts. LRU maps evict cold entries automatically, but the eviction is approximate -- under heavy churn, hot entries can be evicted if the LRU lists are not perfectly maintained. Size the map at 2-3x expected steady-state entries.
✓Map pinning to bpffs (/sys/fs/bpf/) decouples map lifetime from program lifetime. A pinned map survives program restart, allowing a new version of a BPF program to attach to existing state without losing connection tracking entries or counters. Cilium relies on this for seamless datapath upgrades.
✓The bpf() syscall is the single entry point for all map operations from userspace: BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE_ELEM, BPF_MAP_DELETE_ELEM, BPF_MAP_GET_NEXT_KEY. From BPF program context, maps are accessed via helper functions like bpf_map_lookup_elem() that the verifier validates at load time.

Common Pitfalls

✗Using perf_event_output when BPF ring buffer is available. perf event arrays allocate one buffer per CPU, each sized for worst-case throughput. On a 128-core machine with 64 KB per-CPU buffers, that is 8 MB of ring buffer memory fragmented across 128 independent buffers. The BPF ring buffer achieves the same throughput with a single 256 KB buffer and never drops events under asymmetric load where some CPUs are hot and others are idle.
✗Forgetting to use per-CPU maps for frequently updated counters. A regular BPF_MAP_TYPE_HASH protects each bucket with a spin lock. At 1 million updates per second on a 64-core machine, lock contention dominates. The fix is BPF_MAP_TYPE_PERCPU_HASH, which eliminates all locking. The tradeoff: reads require summing NR_CPUS values in userspace.
✗Setting max_entries too low on LRU hash maps. When the map is full and churn is high, the LRU eviction runs on the hot path of every insert. If max_entries matches the expected steady state exactly, brief traffic spikes cause eviction storms that remove entries still in active use. Size LRU maps at 2-3x the expected working set.
✗Not pinning maps that should survive program restarts. Without pinning, a BPF map is destroyed when the last program referencing it is unloaded. Restarting a Cilium agent without pinned maps drops all connection tracking state, causing thousands of connections to reset. Always pin maps that hold persistent state to /sys/fs/bpf/.
✗Blocking in the userspace ring buffer consumer. The BPF ring buffer delivers events in order with a callback or epoll interface. If the consumer blocks on slow I/O (writing events to disk synchronously, making network calls), the ring buffer fills and events are lost. Consume into an in-memory queue first, then drain the queue asynchronously.

Reference

System Calls

bpfperf_event_openmmap

Tools

bpftool map listbpftool map dump id <map_id>bpftool map show id <map_id>bpftool prog showcat /sys/kernel/debug/tracing/trace_pipe

📌

In One Line

BPF maps are the shared memory between eBPF programs and userspace -- pick the wrong type and a monitoring tool either drops events, burns CPU on lock contention, or silently evicts the entries it needs most.

BPF Maps & Ring Buffer

DockerKubernetesNginx

🧠

Mental Model

💡

The Problem

Architecture

The Map Abstraction

From the BPF program side, map access uses helper functions:

void *bpf_map_lookup_elem(struct bpf_map *map, const void *key);
long  bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, __u64 flags);
long  bpf_map_delete_elem(struct bpf_map *map, const void *key);

From userspace, the same operations go through the bpf() syscall:

bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));

Hash Maps

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 262144);
    __type(key, struct conn_key);     /* 13 bytes: IPs + ports + proto */
    __type(value, struct conn_value); /* 32 bytes: counters + state */
} conntrack SEC(".maps");

Array Maps

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 1024);
    __type(key, __u32);
    __type(value, struct config_entry);
} config SEC(".maps");

Per-CPU Maps

int ncpus = libbpf_num_possible_cpus();
struct vip_stats values[ncpus];
bpf_map_lookup_elem(map_fd, &key, values);

__u64 total = 0;
for (int i = 0; i < ncpus; i++)
    total += values[i].packets;

Katran uses per-CPU arrays for exactly this pattern. Each VIP index holds packet and byte counts that each core updates independently. Userspace polls every second and aggregates.

LRU Hash Maps

struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 100000);
    __type(key, struct rate_key);
    __type(value, struct rate_value);
} rate_limiter SEC(".maps");

Size LRU maps at 2-3x the expected steady-state working set. At exactly the working set size, every burst causes eviction storms that remove entries still in active use.

The Ring Buffer

The problems with this design surface under production load:

Memory waste. Each per-CPU buffer must be sized for that CPU's peak event rate. On a 128-core machine, sizing each buffer at 64 KB uses 8 MB total. But if events are bursty and concentrated on a few CPUs, most of that 8 MB sits unused while the hot CPUs overflow.
Event loss under asymmetric load. CPU 47 handles an interrupt storm, fills its buffer, and drops events. CPU 48 through 127 sit idle with empty buffers. The total system has plenty of buffer capacity, but it is stranded on the wrong CPUs.
Userspace complexity. The consumer must epoll_wait() on N file descriptors and handle events from each CPU independently, often with ordering challenges.

BPF_MAP_TYPE_RINGBUF (Linux 5.8+) replaces all of this with a single shared ring buffer:

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  /* 256 KB shared across all CPUs */
} events SEC(".maps");

Userspace consumes events through a single file descriptor:

struct ring_buffer *rb = ring_buffer__new(map_fd, callback, NULL, NULL);
while (1)
    ring_buffer__poll(rb, 100);  /* 100ms timeout */

Map Pinning and Lifetime

By default, a BPF map exists as long as at least one BPF program or file descriptor references it. When the last reference is dropped, the map and all its data are destroyed.

Map pinning solves this. Pinning a map to bpffs creates a filesystem entry at /sys/fs/bpf/<name> that holds a reference to the map:

# Pin a map
bpftool map pin id 42 /sys/fs/bpf/conntrack_map

# Later, retrieve the pinned map
bpftool map show pinned /sys/fs/bpf/conntrack_map

From code:

// Pin
bpf_obj_pin(map_fd, "/sys/fs/bpf/conntrack_map");

// Retrieve
int fd = bpf_obj_get("/sys/fs/bpf/conntrack_map");

Inspecting Maps at Runtime

bpftool is the primary tool for inspecting BPF maps on a running system:

# List all maps
$ bpftool map list
42: hash  name conntrack  flags 0x0
    key 16B  value 32B  max_entries 262144  memlock 12582912B
    pinned /sys/fs/bpf/conntrack_map
57: percpu_array  name vip_counters  flags 0x0
    key 4B  value 16B  max_entries 4096  memlock 33554432B
63: ringbuf  name events  flags 0x0
    max_entries 262144  memlock 266240B

# Dump a hash map
$ bpftool map dump id 42
key: 0a 00 01 05 0a 00 02 0a  1f 90 c3 50 06 00 00 00
value: 00 00 00 00 00 00 03 e8  00 00 00 00 00 18 6a 00 ...

# Dump per-CPU values (shows each CPU's copy)
$ bpftool map lookup percpu id 57 key 0 0 0 0
key: 00 00 00 00
value (CPU 00): 00 00 00 00 00 00 41 a3  00 00 00 00 00 3c f2 80
value (CPU 01): 00 00 00 00 00 00 3e 22  00 00 00 00 00 38 b1 40
value (CPU 02): 00 00 00 00 00 00 42 f1  00 00 00 00 00 3f 01 c0
...

Common Questions

When should a hash map be used instead of an array map?

How much memory does a per-CPU map actually use?

Can BPF maps be shared between multiple BPF programs?

What happens when a ring buffer is full and a BPF program tries to write?

How Technologies Use This

Docker

Kubernetes

Nginx

Same Concept Across Tech

Technology	How it uses BPF maps	Key gotcha
Cilium	Pinned hash maps for conntrack, per-CPU arrays for packet counters, LRU maps for NAT entries	Unpinned maps lose all connection state on agent restart. Always pin to /sys/fs/bpf/
Falco	Ring buffer (previously perf_event_array) for streaming security events to userspace	perf_event_array drops events under asymmetric CPU load. Ring buffer eliminates this
bpftrace	Per-CPU hash maps for @aggregations, arrays for histograms, printf via perf_event_output	Large aggregation maps can hit max_entries limit. Increase with -DBPF_MAP_SIZE
BCC tools	Hash maps for per-process/per-file statistics, arrays for configuration, perf buffers for events	Python bindings add overhead to map reads. For high-frequency polling, use libbpf directly
Katran	Per-CPU arrays for VIP packet/byte counters, hash maps for consistent hashing state	Per-CPU arrays on 128-core machines use 128x memory. Size accordingly

Stack layer mapping (BPF map debugging):

Layer	What to check	Tool
Application	Which map type is being used, and is it the right choice for the access pattern?	bpftool map list, source code review
BPF program	Are map operations succeeding, or returning -ENOENT / -E2BIG?	bpf_trace_printk() on error paths
Map subsystem	Is the map full? Is LRU evicting too aggressively?	bpftool map show id (check max_entries vs current entries)
Ring buffer	Is the consumer keeping up, or is the buffer filling?	bpftool map show, check consumer lag
Kernel	Is bpffs mounted? Are maps pinned correctly?	mount
Hardware	Is per-CPU memory pressure causing allocation failures?	dmesg for BPF allocation errors

If You See This, Think This

Symptom	Likely cause	First check
Events dropped under high load	perf_event_output per-CPU buffer overflow on hot CPUs	Switch to BPF_MAP_TYPE_RINGBUF, or increase per-CPU buffer size
CPU usage spikes on map-heavy BPF program	Shared (non-per-CPU) map with high write contention	Switch to per-CPU map variant, check bpftool map show for type
Map insert returns -E2BIG	Map reached max_entries limit	Increase max_entries or switch to LRU map for auto-eviction
Connection state lost after program restart	Maps not pinned to bpffs	Pin maps with bpf(BPF_OBJ_PIN) to /sys/fs/bpf/, verify with ls
LRU map evicts entries that are still active	max_entries too close to working set size, eviction too aggressive	Size LRU map at 2-3x expected working set
Userspace reads stale counter values	Reading per-CPU map but not summing all CPU copies	Use bpftool map lookup percpu, sum all CPU values in application
Ring buffer consumer falls behind	Synchronous I/O in consumer callback blocks event processing	Consume into in-memory queue, drain asynchronously

When to Use / Avoid

Relevant when:

Building or debugging eBPF programs that communicate state between kernel and userspace
Diagnosing event loss in eBPF-based monitoring tools (Falco, Cilium Hubble, custom probes)
Choosing between perf_event_output and BPF ring buffer for event streaming
Optimizing high-frequency counters in XDP or tc programs (per-CPU vs shared maps)
Understanding why a BPF map runs out of space or why entries disappear (LRU eviction)

Watch out for:

perf_event_output drops events under asymmetric CPU load because per-CPU buffers overflow independently
Shared (non-per-CPU) maps under high write rates cause spin lock contention that dominates CPU usage
Unpinned maps are destroyed on program unload, losing all accumulated state
LRU eviction is approximate and can evict hot entries under sustained pressure near max_entries

Try It Yourself

 1  # List all BPF maps currently loaded in the kernel
 2  
 3  bpftool map list
 4  
 5  # Show detailed info for a specific map
 6  
 7  bpftool map show id 42
 8  
 9  # Dump all entries in a hash map
10  
11  bpftool map dump id 42
12  
13  # Look up a specific key in a map (key in hex bytes)
14  
15  bpftool map lookup id 42 key 0x0a 0x00 0x01 0x01
16  
17  # Look up a per-CPU map entry (shows value on each CPU)
18  
19  bpftool map lookup percpu id 42 key 0x00 0x00 0x00 0x01
20  
21  # Pin a map to bpffs so it survives program restart
22  
23  bpftool map pin id 42 /sys/fs/bpf/my_conntrack_map
24  
25  # List pinned objects on bpffs
26  
27  ls -la /sys/fs/bpf/
28  
29  # Create a hash map from the command line (for testing)
30  
31  bpftool map create /sys/fs/bpf/test_map type hash key 4 value 8 entries 1024 name test_map
32  
33  # Delete a specific entry from a map
34  
35  bpftool map delete id 42 key 0x0a 0x00 0x01 0x01
36  
37  # Show all BPF programs and their map references
38  
39  bpftool prog show
40  
41  # Check if bpffs is mounted
42  
43  mount | grep bpf
44  
45  # Monitor BPF-related kernel messages
46  
47  dmesg | grep -i bpf

Debug Checklist

1List all BPF maps and check types and sizes: bpftool map list
2Dump map contents to verify entries: bpftool map dump id <map_id>
3Check for pinned maps: ls -la /sys/fs/bpf/
4Monitor ring buffer usage: bpftool map show id <ring_buf_id>
5Check for dropped events in perf buffers: perf stat -e bpf:bpf_perf_event_output
6Verify per-CPU map values: bpftool map lookup percpu id <map_id> key <hex_key>

Key Takeaways

✓BPF ring buffer (BPF_MAP_TYPE_RINGBUF) is strictly superior to perf_event_array for event streaming. It uses a single shared buffer instead of per-CPU buffers, which means better memory efficiency (one buffer sized to aggregate throughput, not N buffers each sized for peak per-CPU throughput) and simpler userspace consumption (one fd to poll instead of N).
✓Per-CPU maps are not optional for high-frequency counters. A shared hash map with 10 million updates per second across 64 cores spends more time on spin lock contention than on actual work. Per-CPU variants eliminate all synchronization from the write path. The cost is NR_CPUS copies of each value in memory and a userspace aggregation step on read.
✓LRU hash maps solve the stale entry problem that plagues long-running BPF programs. A connection tracking map without eviction grows until it hits max_entries and then fails all inserts. LRU maps evict cold entries automatically, but the eviction is approximate -- under heavy churn, hot entries can be evicted if the LRU lists are not perfectly maintained. Size the map at 2-3x expected steady-state entries.
✓Map pinning to bpffs (/sys/fs/bpf/) decouples map lifetime from program lifetime. A pinned map survives program restart, allowing a new version of a BPF program to attach to existing state without losing connection tracking entries or counters. Cilium relies on this for seamless datapath upgrades.
✓The bpf() syscall is the single entry point for all map operations from userspace: BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE_ELEM, BPF_MAP_DELETE_ELEM, BPF_MAP_GET_NEXT_KEY. From BPF program context, maps are accessed via helper functions like bpf_map_lookup_elem() that the verifier validates at load time.

Common Pitfalls

✗Using perf_event_output when BPF ring buffer is available. perf event arrays allocate one buffer per CPU, each sized for worst-case throughput. On a 128-core machine with 64 KB per-CPU buffers, that is 8 MB of ring buffer memory fragmented across 128 independent buffers. The BPF ring buffer achieves the same throughput with a single 256 KB buffer and never drops events under asymmetric load where some CPUs are hot and others are idle.
✗Forgetting to use per-CPU maps for frequently updated counters. A regular BPF_MAP_TYPE_HASH protects each bucket with a spin lock. At 1 million updates per second on a 64-core machine, lock contention dominates. The fix is BPF_MAP_TYPE_PERCPU_HASH, which eliminates all locking. The tradeoff: reads require summing NR_CPUS values in userspace.
✗Setting max_entries too low on LRU hash maps. When the map is full and churn is high, the LRU eviction runs on the hot path of every insert. If max_entries matches the expected steady state exactly, brief traffic spikes cause eviction storms that remove entries still in active use. Size LRU maps at 2-3x the expected working set.
✗Not pinning maps that should survive program restarts. Without pinning, a BPF map is destroyed when the last program referencing it is unloaded. Restarting a Cilium agent without pinned maps drops all connection tracking state, causing thousands of connections to reset. Always pin maps that hold persistent state to /sys/fs/bpf/.
✗Blocking in the userspace ring buffer consumer. The BPF ring buffer delivers events in order with a callback or epoll interface. If the consumer blocks on slow I/O (writing events to disk synchronously, making network calls), the ring buffer fills and events are lost. Consume into an in-memory queue first, then drain the queue asynchronously.

Reference

System Calls

bpfperf_event_openmmap

Tools

bpftool map listbpftool map dump id <map_id>bpftool map show id <map_id>bpftool prog showcat /sys/kernel/debug/tracing/trace_pipe

📌

Mental Model

The Problem

Architecture

The Map Abstraction

Hash Maps

Array Maps

Per-CPU Maps

LRU Hash Maps

The Ring Buffer

Map Pinning and Lifetime

Inspecting Maps at Runtime

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Mental Model

The Problem

Architecture

The Map Abstraction

Hash Maps

Array Maps

Per-CPU Maps

LRU Hash Maps

The Ring Buffer

Map Pinning and Lifetime

Inspecting Maps at Runtime

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics