Observability & TuningTopic 2 of 6

System TuningIntermediate

Sysctl Tuning Reference

NginxPostgreSQLRedis

🧠

Mental Model

A factory ships with every dial set to "safe for a small workshop." The dials control how many people can line up at the door (somaxconn), how wide the conveyor belts are (tcp buffer sizes), how aggressively the janitor sweeps (dirty page writeback), and how many toolboxes can be open at once (file-max). Running a warehouse-scale operation on workshop settings means the dials become the bottleneck long before the building does. sysctl is the wrench for adjusting every dial.

💡

The Problem

An Nginx server starts dropping connections under load. dmesg shows "TCP: request_sock_TCP: Possible SYN flooding on port 443" because net.core.somaxconn is still at the default 128. The listen backlog fills instantly during traffic spikes, and the kernel silently discards new SYN packets. Clients see connection timeouts. The application logs show nothing because the rejection happens before Nginx ever sees the connection. Increasing the Nginx backlog directive alone does not help -- the kernel clamps it to somaxconn.

Architecture

A fresh Linux installation ships with kernel parameters tuned for a modest general-purpose workload. These defaults are deliberately conservative: a small listen backlog, moderate socket buffers, eager swapping, and filesystem limits scaled to prevent a misconfigured box from exhausting resources on first boot.

Production servers are not general-purpose. A web proxy handling 100K concurrent connections, a database managing 48 GB of hot data, or a container host running 200 microservices all need different kernel parameters. The gap between default settings and production requirements is where sysctl lives.

The Problem with Defaults

Here is what a stock kernel looks like on parameters that matter:

Parameter	Default	Production Reality
net.core.somaxconn	128	Nginx requests backlog of 511; kernel clamps it to 128
net.ipv4.tcp_rmem (max)	~6 MB	Insufficient for high-BDP links (1 Gbps * 50ms = 6.25 MB)
net.ipv4.tcp_wmem (max)	~4 MB	Same BDP problem on the send side
net.core.netdev_budget	300	10 Gbps NIC delivers 1M+ pps; 300 packets per poll cycle means constant re-scheduling
vm.swappiness	60	Kernel happily swaps database heap pages to make room for page cache
vm.dirty_ratio	20	20% of RAM as dirty pages means a 64 GB host can accumulate 12.8 GB before stalling
vm.overcommit_memory	0	Heuristic mode: allows overcommit, then OOM-kills when memory runs out
fs.file-max	~800K (varies)	200 containers with sockets, logs, and IPC channels burn through this

Every one of these defaults is reasonable for a laptop. Every one of them is wrong for a production server under real load.

Network Tuning

Listen Backlog: somaxconn and tcp_max_syn_backlog

When a client sends a SYN packet, the kernel places the half-open connection in the SYN queue. After the three-way handshake completes, the connection moves to the accept queue. net.core.somaxconn is the ceiling for the accept queue. net.ipv4.tcp_max_syn_backlog is the ceiling for the SYN queue.

The default somaxconn of 128 means every listen() call is clamped to a 128-entry backlog, regardless of what the application requests. During a traffic spike, the queue fills in milliseconds, and subsequent SYN packets are silently dropped. The kernel logs "TCP: request_sock_TCP: Possible SYN flooding" in dmesg, but the application never sees the dropped connections.

# Check current values
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog

# Check if backlog is filling up (Recv-Q approaching Send-Q)
ss -lnt | column -t

# Raise both to handle production traffic
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Socket Buffers: tcp_rmem and tcp_wmem

These parameters control the kernel's auto-tuning range for TCP receive and send buffers. Each takes three values: minimum, default, and maximum (in bytes).

The kernel dynamically adjusts each socket's buffer within this range based on available memory and connection behavior. The default maximum of ~6 MB limits throughput on high-latency links: a 1 Gbps link with 50ms RTT has a bandwidth-delay product of 6.25 MB, and the receiver cannot advertise a window large enough to fill the pipe.

# View current auto-tuning ranges
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem

# Set wider ranges for high-throughput servers
sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"

The minimum (4096) prevents degenerate buffer sizes. The default (87380/65536) is the starting point for new connections. The maximum (33554432 = 32 MB) is the ceiling the auto-tuner can reach. The kernel will not actually allocate 32 MB per socket unless the connection needs it.

NAPI Polling: netdev_budget

Network Interface Cards deliver packets via hardware interrupts. The NAPI subsystem converts these to polled mode: after the first interrupt, the kernel polls the NIC for more packets without triggering additional interrupts. net.core.netdev_budget controls how many packets the kernel processes per poll cycle.

The default of 300 is too low for 10 Gbps NICs. At 1 million packets per second, processing only 300 per cycle means re-scheduling softirq over 3,000 times per second, adding latency and CPU overhead.

sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=8000

Memory Tuning

Swap Behavior: vm.swappiness

This parameter is widely misunderstood. It does not control whether swapping happens. It controls the relative priority the kernel gives to reclaiming anonymous pages (swapping) versus reclaiming page cache.

swappiness=60 (default): roughly equal preference for swapping and cache reclaim
swappiness=10: strongly prefer dropping page cache; swap only under significant pressure
swappiness=0: avoid swapping almost entirely; only swap to prevent OOM

Database hosts should use swappiness=10. Swapping a PostgreSQL shared buffer page or a Redis key to disk causes latency spikes of 10-100ms per page fault. Dropping page cache is almost always the better trade-off, since the database has its own caching layer.

sysctl -w vm.swappiness=10

Dirty Page Writeback: vm.dirty_ratio and vm.dirty_background_ratio

When a process writes to a file, the data lands in page cache and the page is marked dirty. Two thresholds control when dirty pages are flushed to disk:

vm.dirty_background_ratio: when dirty pages exceed this percentage of total RAM, background flusher threads start writeback (non-blocking)
vm.dirty_ratio: when dirty pages exceed this percentage, the writing process itself is forced into synchronous writeback (blocking)

With the defaults (dirty_background_ratio=10, dirty_ratio=20), a 64 GB host can accumulate 12.8 GB of dirty data before any process stalls. On a database host, this means a checkpoint can cause a sudden I/O storm as gigabytes of dirty pages are written at once.

# Tighter thresholds for smoother I/O
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

# Monitor current dirty page status
grep -E "Dirty|Writeback" /proc/meminfo

Overcommit: vm.overcommit_memory

Linux allows processes to allocate more virtual memory than physical memory plus swap (overcommit). Three modes control this:

0 (heuristic): the kernel uses heuristics to decide whether to allow an allocation. This is the default. It allows some overcommit, then invokes the OOM killer when physical memory runs out.
1 (always): always allow allocations. Never fail malloc(). Redis needs this for BGSAVE fork() because fork() requests the full virtual address space even though COW means almost none of it is physically allocated.
2 (strict): the kernel tracks committed memory. Total allocations cannot exceed swap + (overcommit_ratio% of RAM). Allocations that would exceed the limit get ENOMEM immediately.

# Strict accounting for database hosts (except Redis)
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80

# Always-allow for dedicated Redis hosts
sysctl -w vm.overcommit_memory=1

Filesystem Tuning

File Descriptor Limits: fs.file-max and fs.nr_open

Linux tracks open file descriptors at two levels:

fs.file-max: system-wide limit on total open file descriptors across all processes
fs.nr_open: per-process ceiling (the maximum value ulimit -n can be set to)

The per-process soft and hard limits in /etc/security/limits.conf or systemd's LimitNOFILE are capped by fs.nr_open. The system-wide total is capped by fs.file-max.

# Check current usage vs limits
cat /proc/sys/fs/file-nr    # allocated  free  max

# Raise for container hosts
sysctl -w fs.file-max=2097152

On a container host with 200 services, each container might hold 500-2000 open file descriptors (sockets, logs, unix domain sockets, epoll instances). The aggregate easily exceeds the default fs.file-max.

Making Changes Persistent

Runtime changes via sysctl -w are lost on reboot. Persistent configuration goes in /etc/sysctl.d/:

cat > /etc/sysctl.d/99-production.conf << 'EOF'
# Network
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 8000
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1

# Memory
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.overcommit_memory = 2
vm.overcommit_ratio = 80

# Filesystem
fs.file-max = 2097152
EOF

# Apply without rebooting
sysctl --system

Files in /etc/sysctl.d/ are loaded in lexicographic order by systemd-sysctl.service at boot. The 99- prefix ensures site-specific settings override distribution defaults (which typically use lower-numbered prefixes like 10- or 50-).

Key Sysctl Reference Table

Sysctl	Subsystem	Default	Recommended (Web Server)	Recommended (Database)	Recommended (Container Host)
net.core.somaxconn	Network	128	65535	65535	65535
net.ipv4.tcp_max_syn_backlog	Network	128-1024	65535	65535	65535
net.ipv4.tcp_rmem (max)	Network	~6 MB	32 MB	16 MB	16 MB
net.ipv4.tcp_wmem (max)	Network	~4 MB	32 MB	16 MB	16 MB
net.core.netdev_budget	Network	300	600	300	600
net.ipv4.ip_local_port_range	Network	32768-60999	1024-65535	32768-60999	1024-65535
net.ipv4.tcp_tw_reuse	Network	0	1	0	1
vm.swappiness	Memory	60	10	10	10
vm.dirty_ratio	Memory	20	10	10	20
vm.dirty_background_ratio	Memory	10	5	5	10
vm.overcommit_memory	Memory	0	0	2	0
vm.overcommit_ratio	Memory	50	50	80	50
fs.file-max	Filesystem	~800K	2097152	2097152	2097152
fs.nr_open	Filesystem	1048576	1048576	1048576	1048576

Containers and Namespace Isolation

Many sysctls are namespace-aware. Network sysctls like somaxconn and tcp_rmem are per-network-namespace. A container with its own network namespace inherits the init namespace defaults, not the host's tuned values.

Setting sysctls on the host does not propagate into containers. Instead, configure them through the container runtime:

# Docker
docker run --sysctl net.core.somaxconn=65535 --sysctl net.ipv4.tcp_rmem="4096 87380 33554432" myapp

# Kubernetes (pod spec)
# securityContext.sysctls in the pod spec for safe sysctls
# unsafe sysctls require kubelet --allowed-unsafe-sysctls flag

Non-namespaced sysctls (vm.swappiness, fs.file-max) are global and shared across all containers on the host. Changing them affects every process on the system.

Common Questions

How does the kernel decide the default value for fs.file-max?

At boot, the kernel calculates fs.file-max based on available RAM, roughly 10% of memory pages. On a 64 GB host with 4 KB pages, that is about 1.6 million. On a 4 GB host, about 100K. The formula is intentionally conservative. Production hosts with many services or containers should set an explicit value.

What happens when somaxconn is exhausted?

New SYN packets are dropped silently. The kernel increments the ListenOverflows and ListenDrops counters (visible in netstat -s or /proc/net/netstat). The client retries with exponential backoff (1s, 2s, 4s...), causing user-visible latency. If tcp_syncookies is enabled (default on most distributions), the kernel falls back to SYN cookies, which allow connections without queue space but disable TCP options like window scaling and SACK.

Should vm.overcommit_memory be set to 2 everywhere?

No. Strict accounting (mode 2) prevents the OOM killer but also prevents legitimate uses of sparse virtual memory. Redis fork(), Java heap reservation, and mmap of large files all request virtual memory far exceeding physical usage. Mode 2 rejects these allocations unless the overcommit_ratio is set high enough to cover the virtual memory requests. Use mode 2 on database hosts where predictable failure is better than OOM kills. Use mode 1 on dedicated Redis hosts. Leave mode 0 on general-purpose servers.

Why do sysctl changes inside a container not affect the host?

Namespaced sysctls are isolated per namespace. A container in its own network namespace has its own copy of somaxconn, tcp_rmem, and related parameters. Changes inside the container affect only that container's namespace. Non-namespaced sysctls (vm.*, fs.file-max) are shared, and changing them inside a privileged container does affect the host -- which is a security concern.

How Technologies Use This

Nginx

An Nginx reverse proxy behind an AWS NLB handles 100,000 concurrent WebSocket connections. During a Monday morning traffic spike, the connection error rate jumps from 0.01% to 12%. The host has 64 GB of RAM, 32 cores, and a 10 Gbps NIC, none of which are saturated. Packet captures on the NIC show SYN packets arriving but never completing the handshake.

The kernel's net.core.somaxconn defaults to 128, which caps the listen() backlog queue. When Nginx calls listen(fd, 511), the kernel silently truncates the backlog to 128. Incoming SYN packets that arrive faster than Nginx can accept() fill this 128-slot queue, and the kernel drops subsequent SYNs without sending RST. The client retries after 1 second, then 2, then 4, creating the visible timeout cascade. Meanwhile, net.ipv4.tcp_max_syn_backlog controls the separate SYN queue (half-open connections) and defaults to 1024, which also overflows during the burst.

Setting somaxconn=65535 and tcp_max_syn_backlog=65535 via sysctl removes the bottleneck. The kernel now allows Nginx's requested backlog of 511 (or higher if configured) and queues up to 65,535 half-open connections. Combined with raising tcp_rmem and tcp_wmem max values to 33,554,432 bytes for per-socket buffer auto-tuning, the same host handles 120,000 concurrent connections at sub-millisecond proxy latency. The tuning cost is roughly 2 GB of additional kernel memory for socket buffers.

PostgreSQL

A PostgreSQL 15 instance managing a 200 GB analytics database on a 128 GB host triggers the OOM killer during a nightly batch import of 45 million rows. The OOM killer selects the postmaster process, bringing down the entire database. The host has 128 GB of physical RAM, 8 GB of swap, and vm.overcommit_memory set to the default value of 0. Monitoring shows that physical memory usage peaked at 118 GB, meaning the kernel chose to kill PostgreSQL while 10 GB of RAM was still technically free.

The kernel's heuristic overcommit mode (vm.overcommit_memory=0) guesses whether an allocation will succeed. During the batch import, PostgreSQL's work_mem allocations for sorting and hash joins push the committed memory total past the kernel's internal threshold. The kernel allows these allocations until physical memory is nearly exhausted, then the OOM killer activates and picks the highest-RSS process. Setting vm.overcommit_memory=2 with vm.overcommit_ratio=80 switches to strict accounting. The commit limit becomes swap + 80% of RAM (110 GB). Any allocation that would exceed this limit gets a clean ENOMEM, which PostgreSQL handles gracefully by aborting the query rather than dying.

Checkpoint I/O storms are the second sysctl problem. PostgreSQL checkpoints flush dirty shared buffers to disk. With vm.dirty_ratio at the default 20, the kernel allows up to 25 GB of dirty pages before forcing synchronous writeback. When the checkpoint flushes all at once, the NVMe drive saturates and query latency spikes to 500ms. Setting vm.dirty_ratio=10 and vm.dirty_background_ratio=5 starts background writeback earlier and spreads the I/O over time. Combined with lowering vm.swappiness from 60 to 10, the kernel strongly prefers evicting filesystem page cache over swapping PostgreSQL's shared_buffers pages, preventing the catastrophic latency that occurs when heap pages land on swap.

Redis

A Redis 7 instance holding a 28 GB dataset on a 64 GB host experiences 200ms latency spikes every time BGSAVE runs. Normal GET/SET latency is 0.1ms. The spike correlates precisely with the fork() that BGSAVE uses to create a point-in-time snapshot. Transparent Huge Pages (THP) are enabled at the default kernel setting, and vm.swappiness is at the default value of 60.

When Redis forks for BGSAVE, the parent and child share all 28 GB of memory via copy-on-write (COW). With THP enabled, the kernel attempts to maintain 2 MB huge pages. Each write to a COW page triggers a page fault, and THP forces the kernel to copy a full 2 MB page instead of a standard 4 KB page. On a dataset with scattered writes, this means each modified key copies 512 times more memory than necessary. The kernel's khugepaged compaction thread also contends for the mmap_lock, adding further latency. Disabling THP via "echo never > /sys/kernel/mm/transparent_hugepage/enabled" eliminates the spike entirely. Redis documents this as a required tuning step.

The vm.swappiness=60 default tells the kernel to balance between reclaiming page cache and swapping anonymous pages. For Redis, anonymous pages are the dataset itself. Swapping even 100 MB of Redis data to disk causes GET latency to jump from 0.1ms to 10ms when those keys are accessed. Setting vm.swappiness=1 tells the kernel to avoid swapping anonymous pages unless the system is nearly out of memory. Combined with vm.overcommit_memory=1, which ensures fork() always succeeds for BGSAVE regardless of the virtual memory size request, these three sysctl changes transform Redis from unreliable under memory pressure to stable at consistent sub-millisecond latency.

Same Concept Across Tech

Technology	Key sysctls	Why it matters
Nginx/HAProxy	somaxconn=65535, tcp_rmem/tcp_wmem max=32M, netdev_budget=600	Default backlog of 128 drops connections during traffic spikes. Buffer limits cap throughput on high-latency links.
PostgreSQL	swappiness=10, dirty_ratio=10, dirty_background_ratio=5, overcommit_memory=2	Swapping database pages causes 100x latency spikes. Bursty writeback stalls checkpoint I/O. Overcommit heuristic leads to OOM kills instead of clean ENOMEM.
Redis	overcommit_memory=1, swappiness=10, somaxconn=65535	fork() for BGSAVE requests full virtual address space. Overcommit heuristic rejects the fork despite sufficient physical RAM.
Docker/K8s	file-max=2M+, somaxconn=65535, ip_local_port_range=1024-65535, max_map_count=262144	Hundreds of containers exhaust system-wide file descriptors and ephemeral ports. Elasticsearch requires high max_map_count.
MySQL/MariaDB	swappiness=1, dirty_ratio=10, file-max=2M, somaxconn=65535	InnoDB buffer pool must stay in RAM. Connection storms fill the backlog.

Stack layer mapping (connections dropped under load):

Layer	What to check	Tool
Application	Is the listen backlog configured higher than somaxconn?	Application config (nginx: backlog directive)
Socket	Is the listen queue filling up?	ss -lnt (Recv-Q approaching Send-Q)
Kernel network	Is somaxconn or tcp_max_syn_backlog the bottleneck?	sysctl net.core.somaxconn
Kernel memory	Is vm.overcommit rejecting allocations or triggering OOM?	dmesg, /proc/meminfo, sysctl vm.overcommit_memory
Kernel filesystem	Is file-max exhausted?	cat /proc/sys/fs/file-nr
Hardware	Is the NIC dropping packets before the kernel sees them?	ethtool -S eth0

Design Rationale The kernel cannot know at boot whether it will run a single-user desktop, a 100K-connection web proxy, or a 48-core database server. Conservative defaults prevent a misconfigured system from exhausting memory or file descriptors on first boot. sysctl provides the escape hatch: a runtime interface to every tunable parameter, organized by subsystem, writable without recompilation, and persistable across reboots via /etc/sysctl.d/ drop-in files. The trade-off is that operators must know which dials to turn -- and the defaults are wrong for almost every production workload.

If You See This, Think This

Symptom	Likely cause	First check
dmesg "possible SYN flooding"	somaxconn or tcp_max_syn_backlog too low	sysctl net.core.somaxconn and ss -lnt
Connection timeouts under load	Listen backlog full, new SYNs dropped	ss -lnt (Recv-Q near Send-Q), dmesg
"Too many open files" across containers	fs.file-max exhausted system-wide	cat /proc/sys/fs/file-nr
OOM killer strikes during batch jobs	vm.overcommit_memory=0 allows overcommit then kills	sysctl vm.overcommit_memory, dmesg
Redis BGSAVE "Cannot allocate memory"	Overcommit heuristic rejects fork() virtual memory request	sysctl vm.overcommit_memory (set to 1 for Redis)
Latency spikes during checkpoint/fsync	vm.dirty_ratio too high, causing bursty writeback	sysctl vm.dirty_ratio vm.dirty_background_ratio
Swap usage on database host with free RAM	vm.swappiness too high, kernel swapping anonymous pages	sysctl vm.swappiness (lower to 10)
Throughput capped below NIC line rate	tcp_rmem/tcp_wmem max too low or netdev_budget too low	sysctl net.ipv4.tcp_rmem and net.core.netdev_budget

When to Use / Avoid

Relevant when:

An application drops connections despite having spare CPU, memory, and bandwidth
dmesg shows "possible SYN flooding" on a server that is not under attack
A database host triggers the OOM killer during routine operations
Container hosts report "too many open files" across multiple containers
Writeback I/O is bursty, causing latency spikes during database checkpoints

Watch out for:

Network namespace isolation means host-level sysctl changes do not propagate into containers automatically
Raising buffer maximums without understanding total memory impact across thousands of connections
Setting vm.swappiness=0 removes the swap safety net and causes earlier OOM kills
Tuning values from blog posts without verifying they match the workload and hardware

Try It Yourself

 1  # View all current sysctl values for a subsystem
 2  
 3  sysctl -a | grep net.core
 4  
 5  # Apply a sysctl change immediately
 6  
 7  sysctl -w net.core.somaxconn=65535
 8  
 9  # Create a persistent sysctl configuration file
10  
11  cat > /etc/sysctl.d/99-production.conf << 'CONF'
12  # Network tuning
13  net.core.somaxconn = 65535
14  net.ipv4.tcp_max_syn_backlog = 65535
15  net.ipv4.tcp_rmem = 4096 87380 33554432
16  net.ipv4.tcp_wmem = 4096 65536 33554432
17  net.core.netdev_budget = 600
18  net.core.netdev_budget_usecs = 8000
19  net.ipv4.ip_local_port_range = 1024 65535
20  net.ipv4.tcp_tw_reuse = 1
21  
22  # Memory tuning
23  vm.swappiness = 10
24  vm.dirty_ratio = 10
25  vm.dirty_background_ratio = 5
26  vm.overcommit_memory = 2
27  vm.overcommit_ratio = 80
28  
29  # Filesystem tuning
30  fs.file-max = 2097152
31  fs.nr_open = 1048576
32  CONF
33  
34  
35  # Apply all persistent sysctl files without rebooting
36  
37  sysctl --system
38  
39  # Check listen backlog usage on all listening sockets
40  
41  ss -lnt | column -t
42  
43  # Monitor file descriptor usage system-wide
44  
45  cat /proc/sys/fs/file-nr
46  
47  # Check current dirty page status
48  
49  grep -E "Dirty|Writeback" /proc/meminfo
50  
51  # Verify a sysctl survives reboot (check persistent config)
52  
53  grep somaxconn /etc/sysctl.d/*.conf
54  
55  # Compare container vs host sysctl values
56  
57  docker run --rm alpine sysctl net.core.somaxconn

Debug Checklist

1Check current somaxconn: sysctl net.core.somaxconn
2Check listen backlog usage: ss -lnt | column -t
3Check SYN flood warnings: dmesg | grep -i 'syn flood'
4Check file descriptor usage: cat /proc/sys/fs/file-nr
5Check current swappiness: sysctl vm.swappiness
6Check dirty page thresholds: sysctl vm.dirty_ratio vm.dirty_background_ratio
7Check overcommit policy: sysctl vm.overcommit_memory vm.overcommit_ratio
8Check per-process fd limits: ulimit -n (soft) and ulimit -Hn (hard)
9Verify persistent config: sysctl --system && sysctl -a | grep <param>

Key Takeaways

✓sysctl changes via the command line or /proc/sys writes take effect immediately but are lost on reboot. Persistent changes go in /etc/sysctl.d/*.conf files and are applied at boot by systemd-sysctl.service. Always do both: apply now with sysctl -w and persist in a conf file.
✓Network tuning has three layers: global limits (somaxconn, netdev_budget), protocol defaults (tcp_rmem, tcp_wmem, tcp_max_syn_backlog), and per-socket overrides (SO_RCVBUF, SO_SNDBUF via setsockopt). The kernel auto-tunes per-socket buffers within the min/default/max range set by tcp_rmem and tcp_wmem. Setting SO_RCVBUF explicitly disables auto-tuning for that socket.
✓vm.swappiness does not control whether swapping happens. It controls the relative weight the kernel gives to reclaiming anonymous pages (swap) versus page cache. At swappiness=0, the kernel avoids swapping almost entirely and prefers dropping page cache. At swappiness=100, anonymous and page cache reclaim are weighted equally. Database hosts typically use swappiness=10 to protect heap pages from being swapped.
✓vm.dirty_ratio and vm.dirty_background_ratio control when dirty page writeback happens. background_ratio triggers the background flusher threads (pdflush/writeback). dirty_ratio is the hard limit where a writing process is forced to do synchronous writeback and blocks. Setting background_ratio too high causes bursty I/O; setting dirty_ratio too low causes frequent process stalls.
✓fs.file-max sets the system-wide limit on open file descriptors. fs.nr_open sets the ceiling for per-process limits (what ulimit -n can be raised to). The per-process soft/hard limits in /etc/security/limits.conf or systemd LimitNOFILE are capped by nr_open. A common mistake is raising only ulimit while leaving file-max or nr_open at defaults.

Common Pitfalls

✗Raising the Nginx or HAProxy backlog directive without also raising net.core.somaxconn. The kernel clamps the listen backlog to somaxconn, so setting backlog=65535 in Nginx while somaxconn=128 means the actual backlog is 128. Both must be raised together.
✗Setting SO_RCVBUF or SO_SNDBUF explicitly in application code and wondering why tcp_rmem/tcp_wmem changes have no effect. Explicit setsockopt calls disable the kernel auto-tuning for that socket. Remove the explicit setsockopt and let the kernel auto-tune within the tcp_rmem/tcp_wmem range instead.
✗Writing sysctl values only to /proc/sys and forgetting to persist them in /etc/sysctl.d/. After a reboot or kernel update, all values revert to defaults. The production incident repeats, typically at the worst possible time.
✗Setting vm.swappiness=0 on a host that needs swap as a safety net. swappiness=0 tells the kernel to avoid swapping almost entirely, which means the OOM killer activates sooner when memory pressure hits. On database hosts, swappiness=10 is usually the right balance: swap is available as a last resort but the kernel strongly prefers dropping page cache.
✗Increasing tcp_rmem and tcp_wmem max values without considering total memory impact. With 100K connections and a 32 MB max buffer, the theoretical worst case is 3.2 TB of buffer memory. The kernel auto-tuning prevents this in practice, but applications that set large SO_RCVBUF values explicitly bypass auto-tuning and can exhaust memory.
✗Tuning sysctls in the host namespace and expecting them to apply inside containers. Many network sysctls are per-network-namespace. Containers with their own network namespace inherit the defaults, not the host values. Use sysctl settings in the container runtime configuration (docker run --sysctl, Kubernetes securityContext.sysctls) instead.

Reference

System Calls

sysctlopensocketlistenmmap

Tools

sysctl -a/proc/sys/ss -lntsysctl --systemdmesg | grep -i "syn flood"cat /proc/sys/fs/file-nr

📌

In One Line

Default kernel settings cap listen backlogs at 128, limit socket buffers to 6 MB, and let dirty pages pile up -- sysctl adjusts these dials to match production workloads without recompiling anything.

Sysctl Tuning Reference

NginxPostgreSQLRedis

🧠

Mental Model

💡

The Problem

Architecture

The Problem with Defaults

Here is what a stock kernel looks like on parameters that matter:

Parameter	Default	Production Reality
net.core.somaxconn	128	Nginx requests backlog of 511; kernel clamps it to 128
net.ipv4.tcp_rmem (max)	~6 MB	Insufficient for high-BDP links (1 Gbps * 50ms = 6.25 MB)
net.ipv4.tcp_wmem (max)	~4 MB	Same BDP problem on the send side
net.core.netdev_budget	300	10 Gbps NIC delivers 1M+ pps; 300 packets per poll cycle means constant re-scheduling
vm.swappiness	60	Kernel happily swaps database heap pages to make room for page cache
vm.dirty_ratio	20	20% of RAM as dirty pages means a 64 GB host can accumulate 12.8 GB before stalling
vm.overcommit_memory	0	Heuristic mode: allows overcommit, then OOM-kills when memory runs out
fs.file-max	~800K (varies)	200 containers with sockets, logs, and IPC channels burn through this

Every one of these defaults is reasonable for a laptop. Every one of them is wrong for a production server under real load.

Network Tuning

Listen Backlog: somaxconn and tcp_max_syn_backlog

# Check current values
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog

# Check if backlog is filling up (Recv-Q approaching Send-Q)
ss -lnt | column -t

# Raise both to handle production traffic
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Socket Buffers: tcp_rmem and tcp_wmem

These parameters control the kernel's auto-tuning range for TCP receive and send buffers. Each takes three values: minimum, default, and maximum (in bytes).

# View current auto-tuning ranges
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem

# Set wider ranges for high-throughput servers
sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"

NAPI Polling: netdev_budget

sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=8000

Memory Tuning

Swap Behavior: vm.swappiness

swappiness=60 (default): roughly equal preference for swapping and cache reclaim
swappiness=10: strongly prefer dropping page cache; swap only under significant pressure
swappiness=0: avoid swapping almost entirely; only swap to prevent OOM

sysctl -w vm.swappiness=10

Dirty Page Writeback: vm.dirty_ratio and vm.dirty_background_ratio

When a process writes to a file, the data lands in page cache and the page is marked dirty. Two thresholds control when dirty pages are flushed to disk:

vm.dirty_background_ratio: when dirty pages exceed this percentage of total RAM, background flusher threads start writeback (non-blocking)
vm.dirty_ratio: when dirty pages exceed this percentage, the writing process itself is forced into synchronous writeback (blocking)

# Tighter thresholds for smoother I/O
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10

# Monitor current dirty page status
grep -E "Dirty|Writeback" /proc/meminfo

Overcommit: vm.overcommit_memory

Linux allows processes to allocate more virtual memory than physical memory plus swap (overcommit). Three modes control this:

0 (heuristic): the kernel uses heuristics to decide whether to allow an allocation. This is the default. It allows some overcommit, then invokes the OOM killer when physical memory runs out.
1 (always): always allow allocations. Never fail malloc(). Redis needs this for BGSAVE fork() because fork() requests the full virtual address space even though COW means almost none of it is physically allocated.
2 (strict): the kernel tracks committed memory. Total allocations cannot exceed swap + (overcommit_ratio% of RAM). Allocations that would exceed the limit get ENOMEM immediately.

# Strict accounting for database hosts (except Redis)
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80

# Always-allow for dedicated Redis hosts
sysctl -w vm.overcommit_memory=1

Filesystem Tuning

File Descriptor Limits: fs.file-max and fs.nr_open

Linux tracks open file descriptors at two levels:

fs.file-max: system-wide limit on total open file descriptors across all processes
fs.nr_open: per-process ceiling (the maximum value ulimit -n can be set to)

The per-process soft and hard limits in /etc/security/limits.conf or systemd's LimitNOFILE are capped by fs.nr_open. The system-wide total is capped by fs.file-max.

# Check current usage vs limits
cat /proc/sys/fs/file-nr    # allocated  free  max

# Raise for container hosts
sysctl -w fs.file-max=2097152

Making Changes Persistent

Runtime changes via sysctl -w are lost on reboot. Persistent configuration goes in /etc/sysctl.d/:

cat > /etc/sysctl.d/99-production.conf << 'EOF'
# Network
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 8000
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1

# Memory
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.overcommit_memory = 2
vm.overcommit_ratio = 80

# Filesystem
fs.file-max = 2097152
EOF

# Apply without rebooting
sysctl --system

Key Sysctl Reference Table

Sysctl	Subsystem	Default	Recommended (Web Server)	Recommended (Database)	Recommended (Container Host)
net.core.somaxconn	Network	128	65535	65535	65535
net.ipv4.tcp_max_syn_backlog	Network	128-1024	65535	65535	65535
net.ipv4.tcp_rmem (max)	Network	~6 MB	32 MB	16 MB	16 MB
net.ipv4.tcp_wmem (max)	Network	~4 MB	32 MB	16 MB	16 MB
net.core.netdev_budget	Network	300	600	300	600
net.ipv4.ip_local_port_range	Network	32768-60999	1024-65535	32768-60999	1024-65535
net.ipv4.tcp_tw_reuse	Network	0	1	0	1
vm.swappiness	Memory	60	10	10	10
vm.dirty_ratio	Memory	20	10	10	20
vm.dirty_background_ratio	Memory	10	5	5	10
vm.overcommit_memory	Memory	0	0	2	0
vm.overcommit_ratio	Memory	50	50	80	50
fs.file-max	Filesystem	~800K	2097152	2097152	2097152
fs.nr_open	Filesystem	1048576	1048576	1048576	1048576

Containers and Namespace Isolation

Setting sysctls on the host does not propagate into containers. Instead, configure them through the container runtime:

# Docker
docker run --sysctl net.core.somaxconn=65535 --sysctl net.ipv4.tcp_rmem="4096 87380 33554432" myapp

# Kubernetes (pod spec)
# securityContext.sysctls in the pod spec for safe sysctls
# unsafe sysctls require kubelet --allowed-unsafe-sysctls flag

Non-namespaced sysctls (vm.swappiness, fs.file-max) are global and shared across all containers on the host. Changing them affects every process on the system.

Common Questions

How does the kernel decide the default value for fs.file-max?

What happens when somaxconn is exhausted?

Should vm.overcommit_memory be set to 2 everywhere?

Why do sysctl changes inside a container not affect the host?

How Technologies Use This

Nginx

PostgreSQL

Redis

Same Concept Across Tech

Technology	Key sysctls	Why it matters
Nginx/HAProxy	somaxconn=65535, tcp_rmem/tcp_wmem max=32M, netdev_budget=600	Default backlog of 128 drops connections during traffic spikes. Buffer limits cap throughput on high-latency links.
PostgreSQL	swappiness=10, dirty_ratio=10, dirty_background_ratio=5, overcommit_memory=2	Swapping database pages causes 100x latency spikes. Bursty writeback stalls checkpoint I/O. Overcommit heuristic leads to OOM kills instead of clean ENOMEM.
Redis	overcommit_memory=1, swappiness=10, somaxconn=65535	fork() for BGSAVE requests full virtual address space. Overcommit heuristic rejects the fork despite sufficient physical RAM.
Docker/K8s	file-max=2M+, somaxconn=65535, ip_local_port_range=1024-65535, max_map_count=262144	Hundreds of containers exhaust system-wide file descriptors and ephemeral ports. Elasticsearch requires high max_map_count.
MySQL/MariaDB	swappiness=1, dirty_ratio=10, file-max=2M, somaxconn=65535	InnoDB buffer pool must stay in RAM. Connection storms fill the backlog.

Stack layer mapping (connections dropped under load):

Layer	What to check	Tool
Application	Is the listen backlog configured higher than somaxconn?	Application config (nginx: backlog directive)
Socket	Is the listen queue filling up?	ss -lnt (Recv-Q approaching Send-Q)
Kernel network	Is somaxconn or tcp_max_syn_backlog the bottleneck?	sysctl net.core.somaxconn
Kernel memory	Is vm.overcommit rejecting allocations or triggering OOM?	dmesg, /proc/meminfo, sysctl vm.overcommit_memory
Kernel filesystem	Is file-max exhausted?	cat /proc/sys/fs/file-nr
Hardware	Is the NIC dropping packets before the kernel sees them?	ethtool -S eth0

If You See This, Think This

Symptom	Likely cause	First check
dmesg "possible SYN flooding"	somaxconn or tcp_max_syn_backlog too low	sysctl net.core.somaxconn and ss -lnt
Connection timeouts under load	Listen backlog full, new SYNs dropped	ss -lnt (Recv-Q near Send-Q), dmesg
"Too many open files" across containers	fs.file-max exhausted system-wide	cat /proc/sys/fs/file-nr
OOM killer strikes during batch jobs	vm.overcommit_memory=0 allows overcommit then kills	sysctl vm.overcommit_memory, dmesg
Redis BGSAVE "Cannot allocate memory"	Overcommit heuristic rejects fork() virtual memory request	sysctl vm.overcommit_memory (set to 1 for Redis)
Latency spikes during checkpoint/fsync	vm.dirty_ratio too high, causing bursty writeback	sysctl vm.dirty_ratio vm.dirty_background_ratio
Swap usage on database host with free RAM	vm.swappiness too high, kernel swapping anonymous pages	sysctl vm.swappiness (lower to 10)
Throughput capped below NIC line rate	tcp_rmem/tcp_wmem max too low or netdev_budget too low	sysctl net.ipv4.tcp_rmem and net.core.netdev_budget

When to Use / Avoid

Relevant when:

An application drops connections despite having spare CPU, memory, and bandwidth
dmesg shows "possible SYN flooding" on a server that is not under attack
A database host triggers the OOM killer during routine operations
Container hosts report "too many open files" across multiple containers
Writeback I/O is bursty, causing latency spikes during database checkpoints

Watch out for:

Network namespace isolation means host-level sysctl changes do not propagate into containers automatically
Raising buffer maximums without understanding total memory impact across thousands of connections
Setting vm.swappiness=0 removes the swap safety net and causes earlier OOM kills
Tuning values from blog posts without verifying they match the workload and hardware

Try It Yourself

 1  # View all current sysctl values for a subsystem
 2  
 3  sysctl -a | grep net.core
 4  
 5  # Apply a sysctl change immediately
 6  
 7  sysctl -w net.core.somaxconn=65535
 8  
 9  # Create a persistent sysctl configuration file
10  
11  cat > /etc/sysctl.d/99-production.conf << 'CONF'
12  # Network tuning
13  net.core.somaxconn = 65535
14  net.ipv4.tcp_max_syn_backlog = 65535
15  net.ipv4.tcp_rmem = 4096 87380 33554432
16  net.ipv4.tcp_wmem = 4096 65536 33554432
17  net.core.netdev_budget = 600
18  net.core.netdev_budget_usecs = 8000
19  net.ipv4.ip_local_port_range = 1024 65535
20  net.ipv4.tcp_tw_reuse = 1
21  
22  # Memory tuning
23  vm.swappiness = 10
24  vm.dirty_ratio = 10
25  vm.dirty_background_ratio = 5
26  vm.overcommit_memory = 2
27  vm.overcommit_ratio = 80
28  
29  # Filesystem tuning
30  fs.file-max = 2097152
31  fs.nr_open = 1048576
32  CONF
33  
34  
35  # Apply all persistent sysctl files without rebooting
36  
37  sysctl --system
38  
39  # Check listen backlog usage on all listening sockets
40  
41  ss -lnt | column -t
42  
43  # Monitor file descriptor usage system-wide
44  
45  cat /proc/sys/fs/file-nr
46  
47  # Check current dirty page status
48  
49  grep -E "Dirty|Writeback" /proc/meminfo
50  
51  # Verify a sysctl survives reboot (check persistent config)
52  
53  grep somaxconn /etc/sysctl.d/*.conf
54  
55  # Compare container vs host sysctl values
56  
57  docker run --rm alpine sysctl net.core.somaxconn

Debug Checklist

1Check current somaxconn: sysctl net.core.somaxconn
2Check listen backlog usage: ss -lnt | column -t
3Check SYN flood warnings: dmesg | grep -i 'syn flood'
4Check file descriptor usage: cat /proc/sys/fs/file-nr
5Check current swappiness: sysctl vm.swappiness
6Check dirty page thresholds: sysctl vm.dirty_ratio vm.dirty_background_ratio
7Check overcommit policy: sysctl vm.overcommit_memory vm.overcommit_ratio
8Check per-process fd limits: ulimit -n (soft) and ulimit -Hn (hard)
9Verify persistent config: sysctl --system && sysctl -a | grep <param>

Key Takeaways

✓sysctl changes via the command line or /proc/sys writes take effect immediately but are lost on reboot. Persistent changes go in /etc/sysctl.d/*.conf files and are applied at boot by systemd-sysctl.service. Always do both: apply now with sysctl -w and persist in a conf file.
✓Network tuning has three layers: global limits (somaxconn, netdev_budget), protocol defaults (tcp_rmem, tcp_wmem, tcp_max_syn_backlog), and per-socket overrides (SO_RCVBUF, SO_SNDBUF via setsockopt). The kernel auto-tunes per-socket buffers within the min/default/max range set by tcp_rmem and tcp_wmem. Setting SO_RCVBUF explicitly disables auto-tuning for that socket.
✓vm.swappiness does not control whether swapping happens. It controls the relative weight the kernel gives to reclaiming anonymous pages (swap) versus page cache. At swappiness=0, the kernel avoids swapping almost entirely and prefers dropping page cache. At swappiness=100, anonymous and page cache reclaim are weighted equally. Database hosts typically use swappiness=10 to protect heap pages from being swapped.
✓vm.dirty_ratio and vm.dirty_background_ratio control when dirty page writeback happens. background_ratio triggers the background flusher threads (pdflush/writeback). dirty_ratio is the hard limit where a writing process is forced to do synchronous writeback and blocks. Setting background_ratio too high causes bursty I/O; setting dirty_ratio too low causes frequent process stalls.
✓fs.file-max sets the system-wide limit on open file descriptors. fs.nr_open sets the ceiling for per-process limits (what ulimit -n can be raised to). The per-process soft/hard limits in /etc/security/limits.conf or systemd LimitNOFILE are capped by nr_open. A common mistake is raising only ulimit while leaving file-max or nr_open at defaults.

Common Pitfalls

✗Raising the Nginx or HAProxy backlog directive without also raising net.core.somaxconn. The kernel clamps the listen backlog to somaxconn, so setting backlog=65535 in Nginx while somaxconn=128 means the actual backlog is 128. Both must be raised together.
✗Setting SO_RCVBUF or SO_SNDBUF explicitly in application code and wondering why tcp_rmem/tcp_wmem changes have no effect. Explicit setsockopt calls disable the kernel auto-tuning for that socket. Remove the explicit setsockopt and let the kernel auto-tune within the tcp_rmem/tcp_wmem range instead.
✗Writing sysctl values only to /proc/sys and forgetting to persist them in /etc/sysctl.d/. After a reboot or kernel update, all values revert to defaults. The production incident repeats, typically at the worst possible time.
✗Setting vm.swappiness=0 on a host that needs swap as a safety net. swappiness=0 tells the kernel to avoid swapping almost entirely, which means the OOM killer activates sooner when memory pressure hits. On database hosts, swappiness=10 is usually the right balance: swap is available as a last resort but the kernel strongly prefers dropping page cache.
✗Increasing tcp_rmem and tcp_wmem max values without considering total memory impact. With 100K connections and a 32 MB max buffer, the theoretical worst case is 3.2 TB of buffer memory. The kernel auto-tuning prevents this in practice, but applications that set large SO_RCVBUF values explicitly bypass auto-tuning and can exhaust memory.
✗Tuning sysctls in the host namespace and expecting them to apply inside containers. Many network sysctls are per-network-namespace. Containers with their own network namespace inherit the defaults, not the host values. Use sysctl settings in the container runtime configuration (docker run --sysctl, Kubernetes securityContext.sysctls) instead.

Reference

System Calls

sysctlopensocketlistenmmap

Tools

sysctl -a/proc/sys/ss -lntsysctl --systemdmesg | grep -i "syn flood"cat /proc/sys/fs/file-nr

📌

In One Line

Default kernel settings cap listen backlogs at 128, limit socket buffers to 6 MB, and let dirty pages pile up -- sysctl adjusts these dials to match production workloads without recompiling anything.

Mental Model

The Problem

Architecture

The Problem with Defaults

Network Tuning

Listen Backlog: somaxconn and tcp_max_syn_backlog

Socket Buffers: tcp_rmem and tcp_wmem

NAPI Polling: netdev_budget

Memory Tuning

Swap Behavior: vm.swappiness

Dirty Page Writeback: vm.dirty_ratio and vm.dirty_background_ratio

Overcommit: vm.overcommit_memory

Filesystem Tuning

File Descriptor Limits: fs.file-max and fs.nr_open

Making Changes Persistent

Key Sysctl Reference Table

Containers and Namespace Isolation

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Mental Model

The Problem

Architecture

The Problem with Defaults

Network Tuning

Listen Backlog: somaxconn and tcp_max_syn_backlog

Socket Buffers: tcp_rmem and tcp_wmem

NAPI Polling: netdev_budget

Memory Tuning

Swap Behavior: vm.swappiness

Dirty Page Writeback: vm.dirty_ratio and vm.dirty_background_ratio

Overcommit: vm.overcommit_memory

Filesystem Tuning

File Descriptor Limits: fs.file-max and fs.nr_open

Making Changes Persistent

Key Sysctl Reference Table

Containers and Namespace Isolation

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics