NGINX

Why It Exists

Apache's process-per-connection model worked fine at a few thousand concurrent connections. Then the web got big, and it stopped working. Each Apache process ate 8-50 MB of RAM and a kernel thread, so 10,000 connections meant 80-500 GB of RAM just to hold connections open. That's the C10K problem in a nutshell.

Igor Sysoev wrote NGINX in 2004 specifically to kill this problem. The core idea is simple: instead of one thread per connection, use non-blocking I/O multiplexing (epoll/kqueue) so a single worker process can juggle 10,000+ connections with a few megabytes of memory. That architectural bet paid off. NGINX now powers over 30% of all web servers and is the default front door for most microservice architectures.

How It Works Internally

NGINX runs as a master process plus multiple worker processes. The master (running as root) reads the config, binds to ports 80/443, and forks off worker processes that run as an unprivileged user. Each worker is a single-threaded event loop using the OS's I/O multiplexing facility: epoll on Linux, kqueue on FreeBSD/macOS. One thread, thousands of file descriptors, no context-switch overhead.

When a connection arrives, the kernel's SO_REUSEPORT distributes it to a worker. The worker accepts it, adds the socket's file descriptor to its epoll interest set, and from here everything is non-blocking. The worker reads request headers (possibly across multiple epoll events if data trickles in), runs the request through the handler chain (location matching, rewrite rules, header manipulation), and either serves a static file or opens a connection to an upstream server.

For reverse proxying, the worker opens a non-blocking connection to the upstream, writes the request, and adds that upstream socket to its epoll set too. When the upstream responds, the worker reads the response (buffering in memory or spilling to temp files depending on the proxy_buffering settings) and writes it back to the client. The key point: while waiting for I/O on any connection, the worker is handling other connections. There is zero idle waiting. Ever.

The request processing pipeline uses a multi-phase architecture. When a request comes in, NGINX walks it through phases: post-read, server-rewrite, find-config (location matching), rewrite, post-rewrite, pre-access, access (authentication), post-access, try-files, content (the handler that actually generates the response), and log. Each location block can attach handlers at specific phases. This design is why modules compose so cleanly. A rate-limiting module runs at the pre-access phase, an auth module at the access phase, and the proxy module at the content phase. They don't know about each other. They don't need to.

Shared memory zones handle cross-worker state. The limit_req_zone directive allocates a shared memory segment that all workers use for rate-limit counters. proxy_cache_path creates a shared cache any worker can read or write. ssl_session_cache shared stores TLS session tickets so a client resuming a TLS session can land on any worker. These zones use lock-free or minimally-locked data structures (red-black trees, slab allocators) to keep contention low.

Production Architecture

In production, NGINX sits at the edge. It terminates TLS, distributes traffic, and absorbs the first layer of abuse. The standard HA setup: 2+ NGINX instances behind a cloud load balancer (AWS ALB/NLB, GCP LB) or a VRRP pair (keepalived) for on-prem, with each instance running worker_processes auto (one worker per CPU core).

For TLS termination, NGINX takes on the expensive handshake (RSA/ECDSA) at the edge so backends talk plain HTTP over a private network. TLS 1.3 with ECDSA P-256 certs cuts handshake latency from 2 round trips (TLS 1.2 RSA) to 1. OCSP stapling (ssl_stapling on) saves the client from contacting the CA for certificate validation, which shaves 100-300ms off the first connection. Session tickets or a shared SSL session cache enable 0-RTT resumption for returning clients.

The upstream block defines backend pools. A solid production config includes: keepalive 32 (persistent connections to backends), zone upstream_pool 64k (shared memory for health state), max_fails=3 fail_timeout=30s (passive health checking), and least_conn for variable-latency backends. For canary deployments, split_clients routes a percentage of traffic to a new version based on a hash of the client IP or a request header.

Caching is probably NGINX's most underrated feature for cutting backend load. proxy_cache_path creates a disk-backed cache with a shared memory index. Set proxy_cache_valid 200 10m to cache successful responses for 10 minutes. For API responses, proxy_cache_key "$request_uri$arg_page" makes sure pagination is cached separately. A properly configured cache can drop backend traffic by 80-90% for read-heavy workloads. Not using it leaves performance on the table.

Rate limiting uses the leaky bucket algorithm via limit_req_zone. A typical setup: limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s caps each IP at 10 requests per second with a burst allowance. The shared memory zone (10m) stores roughly 160,000 IP addresses. For API gateways, limiting by API key header works well: limit_req_zone $http_x_api_key zone=apikey:10m rate=100r/s.

Decision Criteria

Criteria	NGINX	HAProxy	Envoy	Caddy
Architecture	Event-driven, multi-process	Event-driven, multi-threaded	Event-driven, multi-threaded	Event-driven, goroutines
Primary strength	Web serving + reverse proxy	Pure TCP/HTTP load balancing	Service mesh, dynamic config	Auto-TLS, simplicity
Configuration	Static files, reload for changes	Static files, reload or runtime API	Dynamic via xDS API (control plane)	Caddyfile or JSON API
TLS handling	Excellent (OCSP, session tickets)	Good	Excellent (SDS for cert rotation)	Automatic (Let's Encrypt built-in)
HTTP/2 support	Full (client + upstream)	Client-side only	Full (client + upstream, gRPC native)	Full
Load balancing	Round-robin, least_conn, ip_hash, random	Round-robin, leastconn, source, URI	Round-robin, least_request, ring hash, Maglev	Round-robin, random, least_conn
Health checks	Passive (OSS), Active (Plus)	Active + passive, agent checks	Active + passive, outlier detection	Active + passive
Dynamic config	Reload (graceful but file-based)	Reload or runtime API	Fully dynamic via xDS	Runtime JSON API
Observability	Access logs, stub_status, Plus dashboard	Detailed stats page, Prometheus	Native Prometheus, distributed tracing	Prometheus, structured logs
Throughput (HTTP)	~500K-1M req/sec	~500K-1M req/sec	~200K-500K req/sec	~200K-400K req/sec

Capacity Planning

Worker processes: Set worker_processes auto to match CPU cores. Each worker uses ~2.5-10 MB of RAM base. With worker_connections 4096, each worker handles up to 2,048 simultaneous proxy connections (each proxy uses 2 FDs: client + upstream). A 16-core server means 16 workers * 2,048 connections = roughly 32,000 concurrent proxy connections.

Memory: Base memory is small, around 50-100 MB. The real consumers are proxy buffers (proxy_buffer_size * proxy_buffers * concurrent_connections), response cache (configured size + 10% overhead), rate-limit zones, and SSL session cache. For a proxy handling 10,000 concurrent connections with 8KB buffer each, that's ~80 MB just for buffers. A 10 GB disk cache with 128 MB shared memory index handles around 2 million cached objects.

TLS throughput: ECDSA P-256 handshakes run at about 25,000/sec per core. RSA-2048 is much heavier at ~3,000/sec per core. TLS 1.3 0-RTT resumption hits ~50,000 resumptions/sec per core. For 10,000 new TLS connections/sec, budget 4-8 cores for handshake processing alone.

Upstream connections: Without keepalive, each proxy request opens a new TCP connection (3-way handshake + possible TLS). At 10,000 req/sec, that's 10,000 connections/sec created and destroyed. The problem: ephemeral ports. The default range provides 28,232 ports with a 2-minute TIME_WAIT, which caps throughput at ~235 connections/sec. The fix is keepalive 64 in the upstream block, which maintains 64 idle connections per worker and nearly eliminates connection setup overhead for sustained traffic.

File descriptors: Each worker needs worker_connections * 2 file descriptors (client + upstream) plus handles for logs, cache, and static files. Set worker_rlimit_nofile 65535 and the OS-level ulimit -n 65535. The system-wide limit (/proc/sys/fs/file-max) has to accommodate all workers.

Disk I/O: For cache-heavy deployments, the cache disk needs to handle the write throughput of cache misses plus read throughput of cache hits. A 10,000 req/sec workload with 80% cache hit rate means 2,000 cache miss writes/sec + 8,000 cache reads/sec. Use SSDs for cache storage. Spinning disks will become the bottleneck fast.

Failure Scenarios

Scenario 1: Upstream Connection Exhaustion (Ephemeral Port Starvation)

Trigger: NGINX is proxying 20,000 requests/sec to a backend pool without keepalive in the upstream block. Each request opens a new TCP connection, uses it once, then closes it. Closed connections enter TIME_WAIT for 60-120 seconds (depends on the OS). The ephemeral port range (32768-60999 on Linux = 28,232 ports) gets exhausted because ports are consumed faster than TIME_WAIT releases them.

Impact: NGINX starts returning 502 Bad Gateway for any request that can't establish an upstream connection. The error log fills with connect() failed (99: Cannot assign requested address). From the client's perspective, the service is down. The backends are perfectly healthy, just unreachable. It partially self-recovers as TIME_WAIT connections expire, but under sustained load the system stays in a degraded state.

Detection: Monitor the 502 error rate in access logs. Alert on NGINX error log entries containing "Cannot assign requested address." Track TIME_WAIT connection count via ss -s or netstat, and alert when TIME_WAIT exceeds 20,000. Watch upstream response time for gradual increases, which are an early warning of port pressure.

Recovery: Add keepalive 64 (or higher) to each upstream block and set proxy_http_version 1.1 + proxy_set_header Connection "" to enable HTTP/1.1 keepalive to backends. This eliminates per-request connection teardown. Also widen the ephemeral port range: sysctl -w net.ipv4.ip_local_port_range="1024 65535" and enable TIME_WAIT reuse: sysctl -w net.ipv4.tcp_tw_reuse=1. For multi-backend pools, make sure connections distribute evenly to avoid exhausting ports against a single backend IP.

Scenario 2: Configuration Reload with SSL Certificate Error

Trigger: An automated deployment pipeline updates the NGINX config to add a new virtual host with an SSL certificate. The certificate file path is wrong, or the chain is incomplete (missing intermediate CA). Someone runs nginx -s reload.

Impact: The master process tries to load the new config. SSL certificate loading happens during config parsing, before workers spawn. If the certificate is invalid, the master rejects the entire new config and keeps running old workers with the old config. That's the graceful failure mode, and it works well. But here's the catch: if the certificate exists but is malformed (PEM encoding error, mismatched key), the master may accept the config. New workers serving that virtual host then fail TLS handshakes. Existing connections on old workers keep working fine, but new connections to the affected virtual host break. If it's the primary virtual host, all new traffic hits TLS errors.

Detection: Always run nginx -t before nginx -s reload. It performs a dry-run config check including certificate validation. Monitor ssl_handshake_errors via NGINX Plus or access log analysis (look for connection resets during handshake). Alert on sudden spikes in client-side TLS errors.

Recovery: Immediately reload with the previous known-good config. Build a deployment pipeline that runs nginx -t and rolls back on failure before ever issuing nginx -s reload. Store certificate hashes and validate chain completeness as a pre-deployment check. For zero-risk certificate rotation, use NGINX Plus's key-value store for dynamic SSL certificate loading, or use cert-manager with Kubernetes Ingress, which validates before deploying.

Scenario 3: Cache Thundering Herd on Key Expiration

Trigger: A popular cached resource (say, the homepage API response cached with proxy_cache_valid 200 5m) expires. Thousands of concurrent requests hit at the same instant. All of them see a cache miss. All of them forward to the backend simultaneously.

Impact: The backend, sized for 500 req/sec (because the cache normally absorbs 95% of traffic), suddenly gets hit with 5,000 identical requests. Its connection pool is exhausted, response latency spikes from 50ms to 5 seconds, and some requests timeout. NGINX returns 504 Gateway Timeout for the ones that don't make it. The first successful backend response repopulates the cache, but by then the damage is done. The backend is buried under a backlog, and the cascading latency bleeds into other endpoints sharing the same backend pool.

Detection: Monitor cache hit ratio (proxy_cache_status header analysis). A sudden drop from 95% to 0% is the signal. Track backend request rate and alert on spikes exceeding 3x baseline. Watch upstream response time percentiles, specifically p99 spikes that indicate overload.

Recovery: Enable proxy_cache_lock on. This makes concurrent requests for the same cache key wait while the first request populates the cache, then serves the cached response to everyone waiting. Set proxy_cache_lock_timeout 5s as a safety valve. Add proxy_cache_use_stale updating error timeout to serve stale (expired) content while a background refresh runs, so no cache miss ever reaches the client. For critical resources, use proxy_cache_background_update on to trigger an async refresh before the cache entry actually expires.

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Criteria	NGINX	HAProxy	Envoy	Caddy
Architecture	Event-driven, multi-process	Event-driven, multi-threaded	Event-driven, multi-threaded	Event-driven, goroutines
Primary strength	Web serving + reverse proxy	Pure TCP/HTTP load balancing	Service mesh, dynamic config	Auto-TLS, simplicity
Configuration	Static files, reload for changes	Static files, reload or runtime API	Dynamic via xDS API (control plane)	Caddyfile or JSON API
TLS handling	Excellent (OCSP, session tickets)	Good	Excellent (SDS for cert rotation)	Automatic (Let's Encrypt built-in)
HTTP/2 support	Full (client + upstream)	Client-side only	Full (client + upstream, gRPC native)	Full
Load balancing	Round-robin, least_conn, ip_hash, random	Round-robin, leastconn, source, URI	Round-robin, least_request, ring hash, Maglev	Round-robin, random, least_conn
Health checks	Passive (OSS), Active (Plus)	Active + passive, agent checks	Active + passive, outlier detection	Active + passive
Dynamic config	Reload (graceful but file-based)	Reload or runtime API	Fully dynamic via xDS	Runtime JSON API
Observability	Access logs, stub_status, Plus dashboard	Detailed stats page, Prometheus	Native Prometheus, distributed tracing	Prometheus, structured logs
Throughput (HTTP)	~500K-1M req/sec	~500K-1M req/sec	~200K-500K req/sec	~200K-400K req/sec

Use Cases

Architecture

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: Upstream Connection Exhaustion (Ephemeral Port Starvation)

Scenario 2: Configuration Reload with SSL Certificate Error

Scenario 3: Cache Thundering Herd on Key Expiration

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

NGINX

Use Cases

Architecture

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: Upstream Connection Exhaustion (Ephemeral Port Starvation)

Scenario 2: Configuration Reload with SSL Certificate Error

Scenario 3: Cache Thundering Herd on Key Expiration

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies