Sysctl Tuning Reference
Mental Model
A factory ships with every dial set to "safe for a small workshop." The dials control how many people can line up at the door (somaxconn), how wide the conveyor belts are (tcp buffer sizes), how aggressively the janitor sweeps (dirty page writeback), and how many toolboxes can be open at once (file-max). Running a warehouse-scale operation on workshop settings means the dials become the bottleneck long before the building does. sysctl is the wrench for adjusting every dial.
The Problem
An Nginx server starts dropping connections under load. dmesg shows "TCP: request_sock_TCP: Possible SYN flooding on port 443" because net.core.somaxconn is still at the default 128. The listen backlog fills instantly during traffic spikes, and the kernel silently discards new SYN packets. Clients see connection timeouts. The application logs show nothing because the rejection happens before Nginx ever sees the connection. Increasing the Nginx backlog directive alone does not help -- the kernel clamps it to somaxconn.
Architecture
A fresh Linux installation ships with kernel parameters tuned for a modest general-purpose workload. These defaults are deliberately conservative: a small listen backlog, moderate socket buffers, eager swapping, and filesystem limits scaled to prevent a misconfigured box from exhausting resources on first boot.
Production servers are not general-purpose. A web proxy handling 100K concurrent connections, a database managing 48 GB of hot data, or a container host running 200 microservices all need different kernel parameters. The gap between default settings and production requirements is where sysctl lives.
The Problem with Defaults
Here is what a stock kernel looks like on parameters that matter:
| Parameter | Default | Production Reality |
|---|---|---|
| net.core.somaxconn | 128 | Nginx requests backlog of 511; kernel clamps it to 128 |
| net.ipv4.tcp_rmem (max) | ~6 MB | Insufficient for high-BDP links (1 Gbps * 50ms = 6.25 MB) |
| net.ipv4.tcp_wmem (max) | ~4 MB | Same BDP problem on the send side |
| net.core.netdev_budget | 300 | 10 Gbps NIC delivers 1M+ pps; 300 packets per poll cycle means constant re-scheduling |
| vm.swappiness | 60 | Kernel happily swaps database heap pages to make room for page cache |
| vm.dirty_ratio | 20 | 20% of RAM as dirty pages means a 64 GB host can accumulate 12.8 GB before stalling |
| vm.overcommit_memory | 0 | Heuristic mode: allows overcommit, then OOM-kills when memory runs out |
| fs.file-max | ~800K (varies) | 200 containers with sockets, logs, and IPC channels burn through this |
Every one of these defaults is reasonable for a laptop. Every one of them is wrong for a production server under real load.
Network Tuning
Listen Backlog: somaxconn and tcp_max_syn_backlog
When a client sends a SYN packet, the kernel places the half-open connection in the SYN queue. After the three-way handshake completes, the connection moves to the accept queue. net.core.somaxconn is the ceiling for the accept queue. net.ipv4.tcp_max_syn_backlog is the ceiling for the SYN queue.
The default somaxconn of 128 means every listen() call is clamped to a 128-entry backlog, regardless of what the application requests. During a traffic spike, the queue fills in milliseconds, and subsequent SYN packets are silently dropped. The kernel logs "TCP: request_sock_TCP: Possible SYN flooding" in dmesg, but the application never sees the dropped connections.
# Check current values
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
# Check if backlog is filling up (Recv-Q approaching Send-Q)
ss -lnt | column -t
# Raise both to handle production traffic
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
Socket Buffers: tcp_rmem and tcp_wmem
These parameters control the kernel's auto-tuning range for TCP receive and send buffers. Each takes three values: minimum, default, and maximum (in bytes).
The kernel dynamically adjusts each socket's buffer within this range based on available memory and connection behavior. The default maximum of ~6 MB limits throughput on high-latency links: a 1 Gbps link with 50ms RTT has a bandwidth-delay product of 6.25 MB, and the receiver cannot advertise a window large enough to fill the pipe.
# View current auto-tuning ranges
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
# Set wider ranges for high-throughput servers
sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
sysctl -w net.ipv4.tcp_wmem="4096 65536 33554432"
The minimum (4096) prevents degenerate buffer sizes. The default (87380/65536) is the starting point for new connections. The maximum (33554432 = 32 MB) is the ceiling the auto-tuner can reach. The kernel will not actually allocate 32 MB per socket unless the connection needs it.
NAPI Polling: netdev_budget
Network Interface Cards deliver packets via hardware interrupts. The NAPI subsystem converts these to polled mode: after the first interrupt, the kernel polls the NIC for more packets without triggering additional interrupts. net.core.netdev_budget controls how many packets the kernel processes per poll cycle.
The default of 300 is too low for 10 Gbps NICs. At 1 million packets per second, processing only 300 per cycle means re-scheduling softirq over 3,000 times per second, adding latency and CPU overhead.
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=8000
Memory Tuning
Swap Behavior: vm.swappiness
This parameter is widely misunderstood. It does not control whether swapping happens. It controls the relative priority the kernel gives to reclaiming anonymous pages (swapping) versus reclaiming page cache.
- swappiness=60 (default): roughly equal preference for swapping and cache reclaim
- swappiness=10: strongly prefer dropping page cache; swap only under significant pressure
- swappiness=0: avoid swapping almost entirely; only swap to prevent OOM
Database hosts should use swappiness=10. Swapping a PostgreSQL shared buffer page or a Redis key to disk causes latency spikes of 10-100ms per page fault. Dropping page cache is almost always the better trade-off, since the database has its own caching layer.
sysctl -w vm.swappiness=10
Dirty Page Writeback: vm.dirty_ratio and vm.dirty_background_ratio
When a process writes to a file, the data lands in page cache and the page is marked dirty. Two thresholds control when dirty pages are flushed to disk:
- vm.dirty_background_ratio: when dirty pages exceed this percentage of total RAM, background flusher threads start writeback (non-blocking)
- vm.dirty_ratio: when dirty pages exceed this percentage, the writing process itself is forced into synchronous writeback (blocking)
With the defaults (dirty_background_ratio=10, dirty_ratio=20), a 64 GB host can accumulate 12.8 GB of dirty data before any process stalls. On a database host, this means a checkpoint can cause a sudden I/O storm as gigabytes of dirty pages are written at once.
# Tighter thresholds for smoother I/O
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=10
# Monitor current dirty page status
grep -E "Dirty|Writeback" /proc/meminfo
Overcommit: vm.overcommit_memory
Linux allows processes to allocate more virtual memory than physical memory plus swap (overcommit). Three modes control this:
- 0 (heuristic): the kernel uses heuristics to decide whether to allow an allocation. This is the default. It allows some overcommit, then invokes the OOM killer when physical memory runs out.
- 1 (always): always allow allocations. Never fail malloc(). Redis needs this for BGSAVE fork() because fork() requests the full virtual address space even though COW means almost none of it is physically allocated.
- 2 (strict): the kernel tracks committed memory. Total allocations cannot exceed swap + (overcommit_ratio% of RAM). Allocations that would exceed the limit get ENOMEM immediately.
# Strict accounting for database hosts (except Redis)
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80
# Always-allow for dedicated Redis hosts
sysctl -w vm.overcommit_memory=1
Filesystem Tuning
File Descriptor Limits: fs.file-max and fs.nr_open
Linux tracks open file descriptors at two levels:
- fs.file-max: system-wide limit on total open file descriptors across all processes
- fs.nr_open: per-process ceiling (the maximum value ulimit -n can be set to)
The per-process soft and hard limits in /etc/security/limits.conf or systemd's LimitNOFILE are capped by fs.nr_open. The system-wide total is capped by fs.file-max.
# Check current usage vs limits
cat /proc/sys/fs/file-nr # allocated free max
# Raise for container hosts
sysctl -w fs.file-max=2097152
On a container host with 200 services, each container might hold 500-2000 open file descriptors (sockets, logs, unix domain sockets, epoll instances). The aggregate easily exceeds the default fs.file-max.
Making Changes Persistent
Runtime changes via sysctl -w are lost on reboot. Persistent configuration goes in /etc/sysctl.d/:
cat > /etc/sysctl.d/99-production.conf << 'EOF'
# Network
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 8000
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
# Memory
vm.swappiness = 10
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5
vm.overcommit_memory = 2
vm.overcommit_ratio = 80
# Filesystem
fs.file-max = 2097152
EOF
# Apply without rebooting
sysctl --system
Files in /etc/sysctl.d/ are loaded in lexicographic order by systemd-sysctl.service at boot. The 99- prefix ensures site-specific settings override distribution defaults (which typically use lower-numbered prefixes like 10- or 50-).
Key Sysctl Reference Table
| Sysctl | Subsystem | Default | Recommended (Web Server) | Recommended (Database) | Recommended (Container Host) |
|---|---|---|---|---|---|
| net.core.somaxconn | Network | 128 | 65535 | 65535 | 65535 |
| net.ipv4.tcp_max_syn_backlog | Network | 128-1024 | 65535 | 65535 | 65535 |
| net.ipv4.tcp_rmem (max) | Network | ~6 MB | 32 MB | 16 MB | 16 MB |
| net.ipv4.tcp_wmem (max) | Network | ~4 MB | 32 MB | 16 MB | 16 MB |
| net.core.netdev_budget | Network | 300 | 600 | 300 | 600 |
| net.ipv4.ip_local_port_range | Network | 32768-60999 | 1024-65535 | 32768-60999 | 1024-65535 |
| net.ipv4.tcp_tw_reuse | Network | 0 | 1 | 0 | 1 |
| vm.swappiness | Memory | 60 | 10 | 10 | 10 |
| vm.dirty_ratio | Memory | 20 | 10 | 10 | 20 |
| vm.dirty_background_ratio | Memory | 10 | 5 | 5 | 10 |
| vm.overcommit_memory | Memory | 0 | 0 | 2 | 0 |
| vm.overcommit_ratio | Memory | 50 | 50 | 80 | 50 |
| fs.file-max | Filesystem | ~800K | 2097152 | 2097152 | 2097152 |
| fs.nr_open | Filesystem | 1048576 | 1048576 | 1048576 | 1048576 |
Containers and Namespace Isolation
Many sysctls are namespace-aware. Network sysctls like somaxconn and tcp_rmem are per-network-namespace. A container with its own network namespace inherits the init namespace defaults, not the host's tuned values.
Setting sysctls on the host does not propagate into containers. Instead, configure them through the container runtime:
# Docker
docker run --sysctl net.core.somaxconn=65535 --sysctl net.ipv4.tcp_rmem="4096 87380 33554432" myapp
# Kubernetes (pod spec)
# securityContext.sysctls in the pod spec for safe sysctls
# unsafe sysctls require kubelet --allowed-unsafe-sysctls flag
Non-namespaced sysctls (vm.swappiness, fs.file-max) are global and shared across all containers on the host. Changing them affects every process on the system.
Common Questions
How does the kernel decide the default value for fs.file-max?
At boot, the kernel calculates fs.file-max based on available RAM, roughly 10% of memory pages. On a 64 GB host with 4 KB pages, that is about 1.6 million. On a 4 GB host, about 100K. The formula is intentionally conservative. Production hosts with many services or containers should set an explicit value.
What happens when somaxconn is exhausted?
New SYN packets are dropped silently. The kernel increments the ListenOverflows and ListenDrops counters (visible in netstat -s or /proc/net/netstat). The client retries with exponential backoff (1s, 2s, 4s...), causing user-visible latency. If tcp_syncookies is enabled (default on most distributions), the kernel falls back to SYN cookies, which allow connections without queue space but disable TCP options like window scaling and SACK.
Should vm.overcommit_memory be set to 2 everywhere?
No. Strict accounting (mode 2) prevents the OOM killer but also prevents legitimate uses of sparse virtual memory. Redis fork(), Java heap reservation, and mmap of large files all request virtual memory far exceeding physical usage. Mode 2 rejects these allocations unless the overcommit_ratio is set high enough to cover the virtual memory requests. Use mode 2 on database hosts where predictable failure is better than OOM kills. Use mode 1 on dedicated Redis hosts. Leave mode 0 on general-purpose servers.
Why do sysctl changes inside a container not affect the host?
Namespaced sysctls are isolated per namespace. A container in its own network namespace has its own copy of somaxconn, tcp_rmem, and related parameters. Changes inside the container affect only that container's namespace. Non-namespaced sysctls (vm.*, fs.file-max) are shared, and changing them inside a privileged container does affect the host -- which is a security concern.
How Technologies Use This
An Nginx reverse proxy behind an AWS NLB handles 100,000 concurrent WebSocket connections. During a Monday morning traffic spike, the connection error rate jumps from 0.01% to 12%. The host has 64 GB of RAM, 32 cores, and a 10 Gbps NIC, none of which are saturated. Packet captures on the NIC show SYN packets arriving but never completing the handshake.
The kernel's net.core.somaxconn defaults to 128, which caps the listen() backlog queue. When Nginx calls listen(fd, 511), the kernel silently truncates the backlog to 128. Incoming SYN packets that arrive faster than Nginx can accept() fill this 128-slot queue, and the kernel drops subsequent SYNs without sending RST. The client retries after 1 second, then 2, then 4, creating the visible timeout cascade. Meanwhile, net.ipv4.tcp_max_syn_backlog controls the separate SYN queue (half-open connections) and defaults to 1024, which also overflows during the burst.
Setting somaxconn=65535 and tcp_max_syn_backlog=65535 via sysctl removes the bottleneck. The kernel now allows Nginx's requested backlog of 511 (or higher if configured) and queues up to 65,535 half-open connections. Combined with raising tcp_rmem and tcp_wmem max values to 33,554,432 bytes for per-socket buffer auto-tuning, the same host handles 120,000 concurrent connections at sub-millisecond proxy latency. The tuning cost is roughly 2 GB of additional kernel memory for socket buffers.
A PostgreSQL 15 instance managing a 200 GB analytics database on a 128 GB host triggers the OOM killer during a nightly batch import of 45 million rows. The OOM killer selects the postmaster process, bringing down the entire database. The host has 128 GB of physical RAM, 8 GB of swap, and vm.overcommit_memory set to the default value of 0. Monitoring shows that physical memory usage peaked at 118 GB, meaning the kernel chose to kill PostgreSQL while 10 GB of RAM was still technically free.
The kernel's heuristic overcommit mode (vm.overcommit_memory=0) guesses whether an allocation will succeed. During the batch import, PostgreSQL's work_mem allocations for sorting and hash joins push the committed memory total past the kernel's internal threshold. The kernel allows these allocations until physical memory is nearly exhausted, then the OOM killer activates and picks the highest-RSS process. Setting vm.overcommit_memory=2 with vm.overcommit_ratio=80 switches to strict accounting. The commit limit becomes swap + 80% of RAM (110 GB). Any allocation that would exceed this limit gets a clean ENOMEM, which PostgreSQL handles gracefully by aborting the query rather than dying.
Checkpoint I/O storms are the second sysctl problem. PostgreSQL checkpoints flush dirty shared buffers to disk. With vm.dirty_ratio at the default 20, the kernel allows up to 25 GB of dirty pages before forcing synchronous writeback. When the checkpoint flushes all at once, the NVMe drive saturates and query latency spikes to 500ms. Setting vm.dirty_ratio=10 and vm.dirty_background_ratio=5 starts background writeback earlier and spreads the I/O over time. Combined with lowering vm.swappiness from 60 to 10, the kernel strongly prefers evicting filesystem page cache over swapping PostgreSQL's shared_buffers pages, preventing the catastrophic latency that occurs when heap pages land on swap.
A Redis 7 instance holding a 28 GB dataset on a 64 GB host experiences 200ms latency spikes every time BGSAVE runs. Normal GET/SET latency is 0.1ms. The spike correlates precisely with the fork() that BGSAVE uses to create a point-in-time snapshot. Transparent Huge Pages (THP) are enabled at the default kernel setting, and vm.swappiness is at the default value of 60.
When Redis forks for BGSAVE, the parent and child share all 28 GB of memory via copy-on-write (COW). With THP enabled, the kernel attempts to maintain 2 MB huge pages. Each write to a COW page triggers a page fault, and THP forces the kernel to copy a full 2 MB page instead of a standard 4 KB page. On a dataset with scattered writes, this means each modified key copies 512 times more memory than necessary. The kernel's khugepaged compaction thread also contends for the mmap_lock, adding further latency. Disabling THP via "echo never > /sys/kernel/mm/transparent_hugepage/enabled" eliminates the spike entirely. Redis documents this as a required tuning step.
The vm.swappiness=60 default tells the kernel to balance between reclaiming page cache and swapping anonymous pages. For Redis, anonymous pages are the dataset itself. Swapping even 100 MB of Redis data to disk causes GET latency to jump from 0.1ms to 10ms when those keys are accessed. Setting vm.swappiness=1 tells the kernel to avoid swapping anonymous pages unless the system is nearly out of memory. Combined with vm.overcommit_memory=1, which ensures fork() always succeeds for BGSAVE regardless of the virtual memory size request, these three sysctl changes transform Redis from unreliable under memory pressure to stable at consistent sub-millisecond latency.
Same Concept Across Tech
| Technology | Key sysctls | Why it matters |
|---|---|---|
| Nginx/HAProxy | somaxconn=65535, tcp_rmem/tcp_wmem max=32M, netdev_budget=600 | Default backlog of 128 drops connections during traffic spikes. Buffer limits cap throughput on high-latency links. |
| PostgreSQL | swappiness=10, dirty_ratio=10, dirty_background_ratio=5, overcommit_memory=2 | Swapping database pages causes 100x latency spikes. Bursty writeback stalls checkpoint I/O. Overcommit heuristic leads to OOM kills instead of clean ENOMEM. |
| Redis | overcommit_memory=1, swappiness=10, somaxconn=65535 | fork() for BGSAVE requests full virtual address space. Overcommit heuristic rejects the fork despite sufficient physical RAM. |
| Docker/K8s | file-max=2M+, somaxconn=65535, ip_local_port_range=1024-65535, max_map_count=262144 | Hundreds of containers exhaust system-wide file descriptors and ephemeral ports. Elasticsearch requires high max_map_count. |
| MySQL/MariaDB | swappiness=1, dirty_ratio=10, file-max=2M, somaxconn=65535 | InnoDB buffer pool must stay in RAM. Connection storms fill the backlog. |
Stack layer mapping (connections dropped under load):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the listen backlog configured higher than somaxconn? | Application config (nginx: backlog directive) |
| Socket | Is the listen queue filling up? | ss -lnt (Recv-Q approaching Send-Q) |
| Kernel network | Is somaxconn or tcp_max_syn_backlog the bottleneck? | sysctl net.core.somaxconn |
| Kernel memory | Is vm.overcommit rejecting allocations or triggering OOM? | dmesg, /proc/meminfo, sysctl vm.overcommit_memory |
| Kernel filesystem | Is file-max exhausted? | cat /proc/sys/fs/file-nr |
| Hardware | Is the NIC dropping packets before the kernel sees them? | ethtool -S eth0 |
Design Rationale The kernel cannot know at boot whether it will run a single-user desktop, a 100K-connection web proxy, or a 48-core database server. Conservative defaults prevent a misconfigured system from exhausting memory or file descriptors on first boot. sysctl provides the escape hatch: a runtime interface to every tunable parameter, organized by subsystem, writable without recompilation, and persistable across reboots via /etc/sysctl.d/ drop-in files. The trade-off is that operators must know which dials to turn -- and the defaults are wrong for almost every production workload.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| dmesg "possible SYN flooding" | somaxconn or tcp_max_syn_backlog too low | sysctl net.core.somaxconn and ss -lnt |
| Connection timeouts under load | Listen backlog full, new SYNs dropped | ss -lnt (Recv-Q near Send-Q), dmesg |
| "Too many open files" across containers | fs.file-max exhausted system-wide | cat /proc/sys/fs/file-nr |
| OOM killer strikes during batch jobs | vm.overcommit_memory=0 allows overcommit then kills | sysctl vm.overcommit_memory, dmesg |
| Redis BGSAVE "Cannot allocate memory" | Overcommit heuristic rejects fork() virtual memory request | sysctl vm.overcommit_memory (set to 1 for Redis) |
| Latency spikes during checkpoint/fsync | vm.dirty_ratio too high, causing bursty writeback | sysctl vm.dirty_ratio vm.dirty_background_ratio |
| Swap usage on database host with free RAM | vm.swappiness too high, kernel swapping anonymous pages | sysctl vm.swappiness (lower to 10) |
| Throughput capped below NIC line rate | tcp_rmem/tcp_wmem max too low or netdev_budget too low | sysctl net.ipv4.tcp_rmem and net.core.netdev_budget |
When to Use / Avoid
Relevant when:
- An application drops connections despite having spare CPU, memory, and bandwidth
- dmesg shows "possible SYN flooding" on a server that is not under attack
- A database host triggers the OOM killer during routine operations
- Container hosts report "too many open files" across multiple containers
- Writeback I/O is bursty, causing latency spikes during database checkpoints
Watch out for:
- Network namespace isolation means host-level sysctl changes do not propagate into containers automatically
- Raising buffer maximums without understanding total memory impact across thousands of connections
- Setting vm.swappiness=0 removes the swap safety net and causes earlier OOM kills
- Tuning values from blog posts without verifying they match the workload and hardware
Try It Yourself
1 # View all current sysctl values for a subsystem
2
3 sysctl -a | grep net.core
4
5 # Apply a sysctl change immediately
6
7 sysctl -w net.core.somaxconn=65535
8
9 # Create a persistent sysctl configuration file
10
11 cat > /etc/sysctl.d/99-production.conf << 'CONF'
12 # Network tuning
13 net.core.somaxconn = 65535
14 net.ipv4.tcp_max_syn_backlog = 65535
15 net.ipv4.tcp_rmem = 4096 87380 33554432
16 net.ipv4.tcp_wmem = 4096 65536 33554432
17 net.core.netdev_budget = 600
18 net.core.netdev_budget_usecs = 8000
19 net.ipv4.ip_local_port_range = 1024 65535
20 net.ipv4.tcp_tw_reuse = 1
21
22 # Memory tuning
23 vm.swappiness = 10
24 vm.dirty_ratio = 10
25 vm.dirty_background_ratio = 5
26 vm.overcommit_memory = 2
27 vm.overcommit_ratio = 80
28
29 # Filesystem tuning
30 fs.file-max = 2097152
31 fs.nr_open = 1048576
32 CONF
33
34
35 # Apply all persistent sysctl files without rebooting
36
37 sysctl --system
38
39 # Check listen backlog usage on all listening sockets
40
41 ss -lnt | column -t
42
43 # Monitor file descriptor usage system-wide
44
45 cat /proc/sys/fs/file-nr
46
47 # Check current dirty page status
48
49 grep -E "Dirty|Writeback" /proc/meminfo
50
51 # Verify a sysctl survives reboot (check persistent config)
52
53 grep somaxconn /etc/sysctl.d/*.conf
54
55 # Compare container vs host sysctl values
56
57 docker run --rm alpine sysctl net.core.somaxconnDebug Checklist
- 1
Check current somaxconn: sysctl net.core.somaxconn - 2
Check listen backlog usage: ss -lnt | column -t - 3
Check SYN flood warnings: dmesg | grep -i 'syn flood' - 4
Check file descriptor usage: cat /proc/sys/fs/file-nr - 5
Check current swappiness: sysctl vm.swappiness - 6
Check dirty page thresholds: sysctl vm.dirty_ratio vm.dirty_background_ratio - 7
Check overcommit policy: sysctl vm.overcommit_memory vm.overcommit_ratio - 8
Check per-process fd limits: ulimit -n (soft) and ulimit -Hn (hard) - 9
Verify persistent config: sysctl --system && sysctl -a | grep <param>
Key Takeaways
- ✓sysctl changes via the command line or /proc/sys writes take effect immediately but are lost on reboot. Persistent changes go in /etc/sysctl.d/*.conf files and are applied at boot by systemd-sysctl.service. Always do both: apply now with sysctl -w and persist in a conf file.
- ✓Network tuning has three layers: global limits (somaxconn, netdev_budget), protocol defaults (tcp_rmem, tcp_wmem, tcp_max_syn_backlog), and per-socket overrides (SO_RCVBUF, SO_SNDBUF via setsockopt). The kernel auto-tunes per-socket buffers within the min/default/max range set by tcp_rmem and tcp_wmem. Setting SO_RCVBUF explicitly disables auto-tuning for that socket.
- ✓vm.swappiness does not control whether swapping happens. It controls the relative weight the kernel gives to reclaiming anonymous pages (swap) versus page cache. At swappiness=0, the kernel avoids swapping almost entirely and prefers dropping page cache. At swappiness=100, anonymous and page cache reclaim are weighted equally. Database hosts typically use swappiness=10 to protect heap pages from being swapped.
- ✓vm.dirty_ratio and vm.dirty_background_ratio control when dirty page writeback happens. background_ratio triggers the background flusher threads (pdflush/writeback). dirty_ratio is the hard limit where a writing process is forced to do synchronous writeback and blocks. Setting background_ratio too high causes bursty I/O; setting dirty_ratio too low causes frequent process stalls.
- ✓fs.file-max sets the system-wide limit on open file descriptors. fs.nr_open sets the ceiling for per-process limits (what ulimit -n can be raised to). The per-process soft/hard limits in /etc/security/limits.conf or systemd LimitNOFILE are capped by nr_open. A common mistake is raising only ulimit while leaving file-max or nr_open at defaults.
Common Pitfalls
- ✗Raising the Nginx or HAProxy backlog directive without also raising net.core.somaxconn. The kernel clamps the listen backlog to somaxconn, so setting backlog=65535 in Nginx while somaxconn=128 means the actual backlog is 128. Both must be raised together.
- ✗Setting SO_RCVBUF or SO_SNDBUF explicitly in application code and wondering why tcp_rmem/tcp_wmem changes have no effect. Explicit setsockopt calls disable the kernel auto-tuning for that socket. Remove the explicit setsockopt and let the kernel auto-tune within the tcp_rmem/tcp_wmem range instead.
- ✗Writing sysctl values only to /proc/sys and forgetting to persist them in /etc/sysctl.d/. After a reboot or kernel update, all values revert to defaults. The production incident repeats, typically at the worst possible time.
- ✗Setting vm.swappiness=0 on a host that needs swap as a safety net. swappiness=0 tells the kernel to avoid swapping almost entirely, which means the OOM killer activates sooner when memory pressure hits. On database hosts, swappiness=10 is usually the right balance: swap is available as a last resort but the kernel strongly prefers dropping page cache.
- ✗Increasing tcp_rmem and tcp_wmem max values without considering total memory impact. With 100K connections and a 32 MB max buffer, the theoretical worst case is 3.2 TB of buffer memory. The kernel auto-tuning prevents this in practice, but applications that set large SO_RCVBUF values explicitly bypass auto-tuning and can exhaust memory.
- ✗Tuning sysctls in the host namespace and expecting them to apply inside containers. Many network sysctls are per-network-namespace. Containers with their own network namespace inherit the defaults, not the host values. Use sysctl settings in the container runtime configuration (docker run --sysctl, Kubernetes securityContext.sysctls) instead.
Reference
In One Line
Default kernel settings cap listen backlogs at 128, limit socket buffers to 6 MB, and let dirty pages pile up -- sysctl adjusts these dials to match production workloads without recompiling anything.