API Gateway vs Load Balancer vs Reverse Proxy — Modern Patterns
Difficulty: Intermediate
Key Points for API Gateway vs Load Balancer vs Reverse Proxy
- A load balancer distributes traffic, a reverse proxy mediates it, and an API gateway manages it — they overlap significantly but solve different primary problems.
- Most production architectures use all three, often in a single product: NGINX can be a reverse proxy and load balancer, Kong adds API gateway features on top.
- L4 load balancers (TCP level) are faster but blind to HTTP — they cannot route by URL path, add headers, or do content-based routing.
- API gateways add business logic to the network edge: auth token validation, API key management, request/response transformation, and usage analytics.
- The modern trend is convergence — Envoy, Kong, and cloud ALBs blur the boundaries by offering reverse proxy, load balancing, and gateway features in one product.
Common Mistakes with API Gateway vs Load Balancer vs Reverse Proxy
- Putting business logic in the API gateway. Rate limiting and auth belong there; order validation and pricing rules belong in the services.
- Using an L7 API gateway for TCP/UDP traffic that does not need HTTP-level features — the system pays the parsing overhead for no benefit.
- Not understanding the difference between L4 and L7 load balancing. L4 is faster but cannot route by path, host header, or cookie.
- Running multiple layers of TLS termination unnecessarily. If the ALB terminates TLS, the API gateway does not need to terminate it again (unless re-encryption is required).
- Treating the API gateway as a single point of failure. If it goes down, every API goes down. Always deploy gateways in HA pairs with health checks.
Tools for API Gateway vs Load Balancer vs Reverse Proxy
- Kong (Open Source): Full-featured API gateway built on NGINX with plugin ecosystem for auth, rate limiting, and transformations — Scale: Medium-Enterprise
- AWS API Gateway (Managed): Serverless API management with Lambda integration, usage plans, and API keys — zero infrastructure to manage — Scale: Small-Enterprise
- NGINX (Open Source): Industry-standard reverse proxy and load balancer with proven performance — the foundation most other tools build on — Scale: Small-Enterprise
- Envoy (Open Source): Modern L4/L7 proxy with advanced load balancing, observability, and dynamic configuration via xDS — the cloud-native standard — Scale: Medium-Enterprise
Related to API Gateway vs Load Balancer vs Reverse Proxy
HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, gRPC & Protocol Buffers, TLS Handshake — Step by Step, Service Mesh Networking, CDN & Edge Networking, DDoS & Rate Limiting
ARP & MAC Addresses — Foundations & Data Travel
Difficulty: Intermediate
Key Points for ARP & MAC Addresses
- ARP operates at Layer 2 and bridges the gap between IP addresses (Layer 3) and MAC addresses (Layer 2).
- ARP requests are broadcast to every device on the LAN segment. In large flat networks, ARP traffic can become a serious problem.
- ARP cache entries expire (typically 60-300 seconds on Linux, 120 seconds on most switches) and must be refreshed.
- ARP has zero built-in authentication. Any device can claim any IP-to-MAC mapping — this is the basis of ARP spoofing attacks.
- In cloud environments, ARP is handled differently — AWS uses proxy ARP, and most CNI plugins manage ARP for container networking.
Common Mistakes with ARP & MAC Addresses
- Ignoring ARP in troubleshooting. When ping fails to a host on the same subnet, the problem is often ARP, not routing.
- Allowing flat Layer 2 networks to grow too large. Thousands of hosts on one broadcast domain means ARP storms.
- Not using Dynamic ARP Inspection (DAI) on managed switches, leaving the network vulnerable to ARP spoofing.
- Assuming MAC addresses are always unique. Virtual machines, containers, and cloned images can have duplicate MACs.
- Forgetting that ARP only works within a broadcast domain. Across subnets, the router handles the MAC resolution on each segment.
Tools for ARP & MAC Addresses
- arpwatch (Open Source): Monitoring ARP activity and detecting new or changed MAC-to-IP mappings on a LAN — Scale: Small-Enterprise
- Wireshark (Open Source): Capturing and analyzing ARP packets with full decode and filtering — Scale: Small-Enterprise
- Dynamic ARP Inspection (DAI) (Commercial): Switch-level ARP validation using DHCP snooping database to prevent spoofing — Scale: Medium-Enterprise
- arping (Open Source): Sending ARP requests from the command line to test Layer 2 reachability — Scale: Small-Enterprise
Related to ARP & MAC Addresses
OSI Model — The Real Version, IP Addressing & Subnetting, DHCP Protocol, Life of a Packet, TCP/IP Debugging Toolkit, Zero Trust Networking
CDN & Edge Networking — Performance & Observability
Difficulty: Intermediate
Key Points for CDN & Edge Networking
- CDNs reduce latency by serving content from the nearest PoP — a cache hit at the edge returns in 5-20ms vs 200-500ms from origin
- Cache-Control headers are the contract between the origin and the CDN — misconfigured headers are the #1 cause of caching problems
- The shield/mid-tier cache prevents the thundering herd problem: 300 edge PoPs missing cache simultaneously would send 300 requests to origin
- Anycast means a single IP address resolves to the nearest edge server — no DNS-based geo-routing needed, and failover is automatic via BGP
- Edge compute is not just caching — it handles auth checks, A/B tests, header manipulation, and even full API logic at the edge
Common Mistakes with CDN & Edge Networking
- Setting Cache-Control: no-cache on content that could be cached. Many teams cache-bust everything out of fear, negating the entire CDN benefit
- Not understanding the Vary header. Vary: Accept-Encoding is fine, but Vary: Cookie makes every user get a unique cache entry — effectively disabling caching
- Ignoring cache key design. Including session tokens or random query params in cache keys causes 0% hit rate on content that should be cacheable
- Purging cache globally when only specific URLs changed. Use targeted purge by URL or surrogate key, not nuclear purge-all
- Assuming CDN handles dynamic content automatically. Dynamic API responses need explicit caching rules (stale-while-revalidate, short TTLs) or edge compute
Tools for CDN & Edge Networking
- Cloudflare (Managed): Developer experience, Workers edge compute, DDoS protection, free tier — Scale: Small-Enterprise
- AWS CloudFront (Managed): Deep AWS integration, Lambda@Edge, S3 origin support — Scale: Medium-Enterprise
- Fastly (Managed): Instant purge (<150ms global), VCL programmability, real-time logging — Scale: Medium-Enterprise
- Akamai (Managed): Largest network (4,000+ PoPs), enterprise security, media delivery — Scale: Enterprise
Related to CDN & Edge Networking
Network Latency — Where Time Goes, HTTP/2 — Multiplexing Revolution, HTTP/3 — UDP Takes Over, TLS Handshake — Step by Step, DNS Protocol Deep Dive, DDoS & Rate Limiting, API Gateway vs Load Balancer vs Reverse Proxy
Certificates & PKI — Security & Encryption
Difficulty: Intermediate
Key Points for Certificates & PKI
- Browsers and operating systems ship with ~150 root CA certificates that form the foundation of internet trust.
- Intermediate CAs exist so root keys can stay offline in HSMs — if an intermediate is compromised, only it is revoked, not the root.
- Let's Encrypt issues over 3 million certificates per day using the automated ACME protocol, making HTTPS free and ubiquitous.
- Certificate Transparency (CT) logs have caught multiple CA misissuance incidents, including the Symantec distrust event.
- Certificate pinning was deprecated by Chrome because it caused more outages than it prevented — use CT logs instead.
Common Mistakes with Certificates & PKI
- Forgetting to include intermediate certificates in the server config, causing failures in non-browser clients like curl and mobile apps.
- Letting certificates expire in production because nobody set up automated renewal. This causes full outages with no graceful degradation.
- Using self-signed certificates in production without proper trust distribution — every client must explicitly trust the CA.
- Generating RSA keys smaller than 2048 bits. Anything below this is considered insecure and rejected by modern browsers.
- Storing private keys in plaintext on disk or in version control. Use HSMs, Vault, or at minimum encrypted file systems.
Tools for Certificates & PKI
- Let's Encrypt (Open Source): Free, automated DV certificates for public-facing domains — Scale: Small-Enterprise
- DigiCert (Commercial): EV and OV certificates with SLA-backed issuance and support — Scale: Enterprise
- AWS ACM (Managed): Auto-provisioned and auto-renewed certificates for AWS resources (ALB, CloudFront, API Gateway) — Scale: Enterprise
- cert-manager (Open Source): Automated certificate lifecycle management in Kubernetes clusters — Scale: Small-Enterprise
Related to Certificates & PKI
TLS Handshake — Step by Step, mTLS — Mutual Authentication, OAuth 2.0 & OIDC Flows, Zero Trust Networking, Service Mesh Networking
Connection Pooling & Keep-Alive — Transport & Reliability
Difficulty: Intermediate
Key Points for Connection Pooling & Keep-Alive
- A new TCP connection costs 1 RTT (handshake) + 1-2 RTT (TLS) + slow start ramp-up. On a 100ms link, that's 200-300ms before data flows at full speed
- HTTP/1.1 Keep-Alive was the first step: reuse the TCP connection for sequential requests. HTTP/2 took it further with multiplexed concurrent requests on one connection
- Database connection pools (HikariCP, pgbouncer) are critical because database handshakes are even more expensive than HTTP — PostgreSQL's fork-per-connection model makes this essential
- Pool sizing is a balance: too few connections causes queuing, too many exhausts server resources. Little's Law (L = lambda * W) is the guide
- Connection health checking prevents borrowing dead connections. A stale connection that fails on first use is worse than creating a new one
Common Mistakes with Connection Pooling & Keep-Alive
- Setting max pool size equal to max threads. With 200 threads and 200 DB connections, connections are held during CPU work. 20-30 connections typically serve 200 threads
- Not configuring idle timeout. Connections sitting idle for minutes get killed by firewalls, NATs, or load balancers — then the app gets a surprise 'connection reset'
- Leaking connections by not returning them to the pool in error paths. Always use try-finally (or equivalent) to guarantee connection return
- Ignoring connection validation. A TCP connection can be half-closed without either side knowing. Validate before use with a lightweight query (SELECT 1)
- Using default pool settings in production. Every database, cloud provider, and load balancer has different timeout defaults — they must be aligned
Tools for Connection Pooling & Keep-Alive
- HikariCP (Open Source): JVM database connection pooling — fastest, most battle-tested pool for Java/Kotlin — Scale: Any
- PgBouncer (Open Source): PostgreSQL connection pooling proxy — essential for serverless and high-connection-count environments — Scale: Medium-Enterprise
- ProxySQL (Open Source): MySQL/MariaDB connection pooling with query routing, caching, and read/write splitting — Scale: Medium-Enterprise
- Envoy Proxy (Open Source): HTTP/gRPC connection pooling for service mesh with circuit breaking and outlier detection — Scale: Large-Enterprise
Related to Connection Pooling & Keep-Alive
TCP Deep Dive, TLS Handshake — Step by Step, HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, gRPC & Protocol Buffers, Network Latency — Where Time Goes, Service Mesh Networking
Container Networking & Namespaces — Modern Patterns
Difficulty: Advanced
Key Points for Container Networking & Namespaces
- Every container gets its own network namespace with a dedicated network stack — isolated interfaces, routes, and iptables rules that cannot see the host's or other containers' stacks.
- veth pairs are the plumbing: one end sits inside the container namespace (eth0), the other connects to a bridge or directly to the host routing table. Deleting either end destroys both.
- VXLAN encapsulates L2 frames inside UDP packets (port 4789) to stretch a flat L2 network across L3 boundaries — the overlay tax is roughly 50 bytes of header per packet.
- Kubernetes requires that every pod gets a routable IP and that pods communicate without NAT. The CNI plugin enforces this contract regardless of the underlying network topology.
- kube-proxy in IPVS mode uses hash tables for O(1) service routing, supporting 10,000+ services without the linear chain-walk penalty of iptables mode.
Common Mistakes with Container Networking & Namespaces
- Assuming containers on different hosts can communicate without an overlay or direct routing setup. Without VXLAN, IP-in-IP, or BGP-advertised routes, cross-host pod traffic is black-holed.
- Running Docker's default bridge mode in production Kubernetes. The docker0 bridge uses NAT and port mapping, violating Kubernetes' flat-network requirement.
- Ignoring MTU mismatches when using overlays. VXLAN adds 50 bytes of header — if the underlying network MTU is 1500, the container MTU must be 1450 or fragmentation kills throughput.
- Not setting resource limits on kube-proxy. In iptables mode with 5,000+ services, kube-proxy can consume significant CPU regenerating rules on every endpoint change.
- Debugging container networking from the host namespace. The container has a different routing table — always exec into the container or use nsenter to enter its network namespace.
Tools for Container Networking & Namespaces
- Calico (Open Source): BGP-based pod networking with no overlay overhead, strong NetworkPolicy enforcement, and support for both iptables and eBPF data planes — Scale: Medium-Enterprise
- Flannel (Open Source): Simple VXLAN overlay networking with minimal configuration — good for small clusters where advanced policy is not needed — Scale: Small-Medium
- Cilium (Open Source): eBPF-native CNI that replaces kube-proxy, provides L7 network policy, and includes built-in observability via Hubble — Scale: Medium-Enterprise
- WeaveNet (Open Source): Mesh overlay with automatic encryption and multicast support, easy setup for development and smaller clusters — Scale: Small-Medium
Related to Container Networking & Namespaces
IP Addressing & Subnetting, Service Mesh Networking, eBPF for Networking, NAT — Network Address Translation, ARP & MAC Addresses
CORS — Cross-Origin Resource Sharing — Security & Encryption
Difficulty: Beginner
Key Points for CORS — Cross-Origin Resource Sharing
- CORS is enforced by the browser, not the server. The server only sends headers — the browser decides whether to allow the response.
- curl and Postman ignore CORS entirely because they are not browsers. If an API works in curl but not in the browser, it is a CORS issue.
- Preflight requests (OPTIONS) only happen for 'non-simple' requests — those with custom headers, non-standard methods, or JSON content type.
- Access-Control-Allow-Origin: * cannot be used with credentials (cookies). The server must echo the specific origin.
- Preflight responses can be cached with Access-Control-Max-Age to avoid an OPTIONS request before every actual request.
Common Mistakes with CORS — Cross-Origin Resource Sharing
- Setting Access-Control-Allow-Origin: * while also setting Access-Control-Allow-Credentials: true — browsers reject this combination.
- Forgetting to handle the OPTIONS preflight request on the server, returning 404 or 405, which blocks the actual request.
- Caching preflight responses too aggressively (very long Max-Age) making it impossible to update CORS policy quickly.
- Reflecting the request's Origin header back as Access-Control-Allow-Origin without validating it against an allowlist — this is effectively no CORS protection.
- Only configuring CORS on the application server but not on CDN or reverse proxy layers that might strip or override headers.
Tools for CORS — Cross-Origin Resource Sharing
- Nginx (Open Source): Configuring CORS headers at the reverse proxy level for all backends uniformly — Scale: Small-Enterprise
- AWS API Gateway (Managed): Built-in CORS configuration with per-route control and automatic OPTIONS handling — Scale: Small-Enterprise
- Cloudflare Workers (Managed): Edge-level CORS header injection with programmable rules — Scale: Enterprise
- Express cors middleware (Open Source): Flexible CORS configuration in Node.js applications with origin allowlists — Scale: Small-Enterprise
Related to CORS — Cross-Origin Resource Sharing
HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, OAuth 2.0 & OIDC Flows, REST vs GraphQL vs gRPC, API Gateway vs Load Balancer vs Reverse Proxy, DNS Protocol Deep Dive
DDoS & Rate Limiting — Security & Encryption
Difficulty: Advanced
Key Points for DDoS & Rate Limiting
- DDoS attacks operate at different layers and require layer-specific defenses. A single firewall cannot protect against all types.
- Volumetric attacks are the largest (measured in Tbps) but the easiest to mitigate with Anycast and scrubbing centers.
- Application-layer attacks are the hardest to mitigate because each request looks legitimate — effective defense requires behavioral analysis.
- Rate limiting is not just for DDoS. It protects against accidental traffic spikes, misbehaving clients, and cost overruns.
- The token bucket algorithm is the most widely used rate limiter because it allows bursts while enforcing an average rate.
Common Mistakes with DDoS & Rate Limiting
- Implementing rate limiting only at the application level, missing attacks that overwhelm the network or transport layer.
- Using a fixed window rate limiter that allows double the limit at window boundaries — use sliding window instead.
- Rate limiting by IP address only, which punishes users behind NAT/proxies sharing an IP and is bypassed by botnets with millions of IPs.
- Setting rate limits too high to be useful or too low and blocking legitimate traffic — test with production traffic patterns first.
- Not having a DDoS response runbook. When an attack hits, it is too late to figure out who to call and what buttons to push.
Tools for DDoS & Rate Limiting
- Cloudflare (Managed): Global Anycast network with L3-L7 DDoS protection, WAF, and bot management — Scale: Enterprise
- AWS Shield + WAF (Managed): AWS-native DDoS protection (Shield Standard free, Advanced with SLA) paired with WAF rules — Scale: Enterprise
- Akamai Prolexic (Commercial): Dedicated DDoS scrubbing with BGP rerouting for the largest volumetric attacks — Scale: Enterprise
- fail2ban (Open Source): Host-level intrusion prevention that bans IPs based on log patterns (SSH brute force, HTTP abuse) — Scale: Small
Related to DDoS & Rate Limiting
TCP Deep Dive, UDP — When Speed Beats Safety, DNS Protocol Deep Dive, CDN & Edge Networking, API Gateway vs Load Balancer vs Reverse Proxy, Network Observability, TCP Congestion Control
DHCP Protocol — Foundations & Data Travel
Difficulty: Beginner
Key Points for DHCP Protocol
- DHCP assigns IP address, subnet mask, default gateway, DNS servers, and lease duration in a single exchange.
- The DORA process (Discover → Offer → Request → Acknowledge) uses exactly 4 UDP packets on ports 67 (server) and 68 (client).
- DHCP Discover is a broadcast — the client has no IP yet, so it sends to 255.255.255.255 from 0.0.0.0.
- Lease renewal happens at 50% (T1) and 87.5% (T2) of the lease duration. If both fail, the client must start over.
- In cloud environments, DHCP is managed by the platform — AWS VPC DHCP option sets configure DNS and domain names for all instances.
Common Mistakes with DHCP Protocol
- Running two DHCP servers on the same subnet without coordination. Both will hand out addresses, causing IP conflicts.
- Setting lease times too long. If a device disconnects, its IP is locked for the full lease duration, wasting addresses.
- Setting lease times too short. Frequent renewals generate unnecessary traffic and risk brief outages during renewal failure.
- Forgetting to configure DHCP relay when adding a new VLAN. Devices on that VLAN will never get an IP address.
- Not reserving static IPs outside the DHCP pool. Printers, servers, and network gear with static IPs can conflict with DHCP assignments.
Tools for DHCP Protocol
- ISC DHCP (dhcpd) (Open Source): Battle-tested DHCP server for Linux — the de facto standard for decades — Scale: Medium-Enterprise
- Kea DHCP (Open Source): Modern replacement for ISC DHCP with a REST API and database backends — Scale: Medium-Enterprise
- dnsmasq (Open Source): Lightweight combined DNS + DHCP server, perfect for small networks and lab environments — Scale: Small-Enterprise
- Windows DHCP Server (Commercial): Active Directory integrated DHCP with GUI management and failover clustering — Scale: Medium-Enterprise
Related to DHCP Protocol
IP Addressing & Subnetting, ARP & MAC Addresses, DNS Protocol Deep Dive, NAT — Network Address Translation, OSI Model — The Real Version, Life of a Packet
DNS Protocol Deep Dive — Application Protocols
Difficulty: Intermediate
Key Points for DNS Protocol Deep Dive
- DNS resolution involves up to 4 hops: browser cache → OS cache → recursive resolver → authoritative server (via root → TLD)
- TTL (Time to Live) controls how long each cache layer holds a record — too short wastes bandwidth, too long delays changes
- DNS uses UDP by default for speed (single packet query/response) but falls back to TCP for responses over 512 bytes
- DNS-over-HTTPS (DoH) encrypts DNS queries inside HTTPS, preventing ISPs and networks from snooping on browsing activity
- A single DNS lookup failure can cascade into a complete outage — DNS is the most critical single point of failure on the internet
Common Mistakes with DNS Protocol Deep Dive
- Setting TTLs too high before a migration — there is no way to force clients to drop cached records, so lower TTLs BEFORE the change
- Not understanding CNAME flattening — CNAMEs at the zone apex (example.com) violate RFC 1034 but some providers support it
- Forgetting that DNS propagation isn't instant — different resolvers cache records for different durations based on TTL
- Using A records when CNAME would be better — A records hardcode IPs, CNAMEs follow name changes automatically
- Not monitoring DNS resolution time — slow DNS adds latency to every single request end users make
Tools for DNS Protocol Deep Dive
- Cloudflare DNS (1.1.1.1) (Managed): Fastest public recursive resolver with built-in privacy and malware filtering — Scale: Global anycast, sub-15ms median response time
- AWS Route 53 (Managed): Authoritative DNS integrated with AWS ecosystem, health checks, and traffic routing — Scale: Enterprise DNS with 100% SLA
- BIND (Open Source): The reference DNS implementation — full-featured authoritative and recursive server — Scale: Runs root servers and large ISP resolvers
- CoreDNS (Open Source): Kubernetes-native DNS with plugin architecture for service discovery — Scale: Cloud-native clusters
Related to DNS Protocol Deep Dive
Life of a Packet, SMTP & Email Protocols, CDN & Edge Networking, HTTP/1.1 — The Foundation, TCP Deep Dive, UDP — When Speed Beats Safety, DHCP Protocol
DNS Security & DNSSEC — Security & Encryption
Difficulty: Advanced
Key Points for DNS Security & DNSSEC
- DNSSEC adds cryptographic authentication to DNS responses but does NOT encrypt them — it proves the answer is genuine, not that it is private
- The chain of trust flows from root (.) → TLD (.com) → zone (example.com) using DS records at each delegation point
- DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) encrypt queries for privacy but do not authenticate responses — DNSSEC and DoH/DoT solve orthogonal problems
- The Kaminsky attack (2008) demonstrated that DNS cache poisoning could be performed in seconds by exploiting predictable transaction IDs and source ports
- DNSSEC adoption remains below 30% of domains despite being standardized in 2005 — key management complexity and DNSSEC-induced outages are the primary barriers
Common Mistakes with DNS Security & DNSSEC
- Confusing DNSSEC with DoH/DoT. DNSSEC authenticates responses (integrity). DoH/DoT encrypt queries (privacy). They are complementary, not alternatives.
- Letting DNSSEC signatures expire. RRSIG records have explicit expiration dates — if the zone is not re-signed on schedule, resolvers reject every response as BOGUS.
- Not monitoring DNSSEC validation failures. When a resolver marks a domain as BOGUS, clients get SERVFAIL — indistinguishable from a total DNS outage without specific monitoring.
- Deploying DNSSEC without a key rollover plan. KSK rollovers require updating the DS record in the parent zone — botching this breaks the entire chain of trust.
- Assuming DNSSEC protects the last mile. DNSSEC validates the path from authoritative server to resolver, but the hop from resolver to client is unprotected without DoH/DoT.
Tools for DNS Security & DNSSEC
- Cloudflare DNS (1.1.1.1) (Managed): Fastest public resolver with automatic DNSSEC validation, DoH, and DoT support out of the box — Scale: Global anycast, sub-15ms median
- Google Public DNS (8.8.8.8) (Managed): Widely trusted public resolver with full DNSSEC validation and both DoH and DoT endpoints — Scale: Global, handles trillions of queries
- Unbound (Open Source): Lightweight validating recursive resolver ideal for local DNSSEC validation and privacy-focused setups — Scale: Small to Enterprise
- BIND (Open Source): Full-featured authoritative and recursive server with comprehensive DNSSEC signing and validation support — Scale: Runs root servers and large ISPs
Related to DNS Security & DNSSEC
DNS Protocol Deep Dive, TLS Handshake — Step by Step, Certificates & PKI, Zero Trust Networking, DDoS & Rate Limiting
eBPF for Networking — Modern Patterns
Difficulty: Advanced
Key Points for eBPF for Networking
- eBPF runs inside the kernel but is safely sandboxed — the verifier guarantees programs cannot crash the kernel, access arbitrary memory, or enter infinite loops.
- XDP processes packets at the NIC driver level, achieving 10M+ packets/sec on a single core — 5-10x faster than iptables for the same workload.
- Cilium uses eBPF to replace kube-proxy entirely, implementing Kubernetes service load balancing without any iptables rules — critical when clusters have 10,000+ services.
- eBPF programs are JIT-compiled to native machine code, running at near-native speed with no interpreter overhead.
- Unlike kernel modules, eBPF programs can be loaded and updated without rebooting or recompiling the kernel, enabling live network policy changes.
Common Mistakes with eBPF for Networking
- Writing eBPF programs that exceed the verifier's complexity limit (1 million instructions). The verifier rejects overly complex programs to guarantee safety.
- Assuming eBPF works on all kernels. Linux 4.15+ is required for basic networking, 5.10+ for full features. This rules out older RHEL/CentOS 7 systems.
- Using eBPF maps without proper locking or per-CPU variants, causing contention under high concurrency that negates performance gains.
- Not accounting for the limited stack space (512 bytes) in eBPF programs. Complex packet parsing needs tail calls or helper functions, not deep recursion.
- Deploying Cilium without understanding that it replaces kube-proxy — existing iptables-based services and NetworkPolicies behave differently.
Tools for eBPF for Networking
- Cilium (Open Source): Kubernetes CNI and service mesh using eBPF for networking, security, and observability without sidecars — Scale: Medium-Enterprise
- Calico eBPF (Open Source): eBPF data plane for Calico's existing network policy engine, good migration path from iptables-based Calico — Scale: Medium-Enterprise
- Katran (Open Source): Meta's XDP-based L4 load balancer handling millions of connections per second at the network edge — Scale: Enterprise
- Falco (Open Source): Runtime security monitoring using eBPF to detect anomalous network connections and syscalls in containers — Scale: Medium-Enterprise
Related to eBPF for Networking
Life of a Packet, OSI Model — The Real Version, TCP Deep Dive, Service Mesh Networking, Network Observability, DDoS & Rate Limiting
gRPC & Protocol Buffers — Application Protocols
Difficulty: Intermediate
Key Points for gRPC & Protocol Buffers
- Protobuf is 3-10x smaller and 20-100x faster to parse than JSON, making gRPC ideal for high-throughput internal communication
- gRPC supports four streaming modes: unary, server streaming, client streaming, and bidirectional streaming
- Deadlines propagate across services — if Service A gives Service B a 5s deadline, B's call to C carries the remaining time
- Code generation from .proto files ensures client and server always agree on the contract — no runtime surprises
- gRPC reflection allows runtime schema discovery, enabling tools like grpcurl to work without compiled protos
Common Mistakes with gRPC & Protocol Buffers
- Not setting deadlines — without them, a slow downstream service can hold connections open indefinitely
- Using gRPC for browser-facing APIs without gRPC-Web — browsers cannot make native HTTP/2 gRPC calls
- Breaking backward compatibility by changing proto field numbers instead of adding new fields
- Ignoring error details — gRPC status codes are richer than HTTP status codes, use google.rpc.Status for structured errors
- Sending large payloads (>4MB default) without increasing max message size or using streaming instead
Tools for gRPC & Protocol Buffers
- gRPC (Open Source): High-performance, strongly-typed service-to-service communication with streaming — Scale: Google-scale internal infrastructure
- Apache Thrift (Open Source): Cross-language RPC with multiple transport and protocol options — Scale: Facebook's internal services
- Apache Avro RPC (Open Source): Schema-evolution-friendly RPC, especially in data pipeline ecosystems — Scale: Hadoop and Kafka ecosystems
- Cap'n Proto (Open Source): Zero-copy serialization for maximum performance in latency-sensitive paths — Scale: Cloudflare Workers, specialized high-perf systems
Related to gRPC & Protocol Buffers
HTTP/2 — Multiplexing Revolution, REST vs GraphQL vs gRPC, Connection Pooling & Keep-Alive, TLS Handshake — Step by Step, Service Mesh Networking, API Gateway vs Load Balancer vs Reverse Proxy
Head-of-Line Blocking — Performance & Observability
Difficulty: Advanced
Key Points for Head-of-Line Blocking
- Head-of-line blocking is when a stalled item at the front of a queue prevents everything behind it from progressing — it exists at multiple protocol layers
- HTTP/1.1 has request-level HOL blocking: one slow response blocks all subsequent requests on that connection, forcing browsers to open 6 parallel connections
- HTTP/2 solved HTTP-level HOL blocking with multiplexing but introduced TCP-level HOL blocking: one lost TCP segment blocks all multiplexed streams
- HTTP/3 with QUIC eliminates HOL blocking at both levels by giving each stream independent loss recovery over UDP
- The impact of HOL blocking increases dramatically with packet loss — at 2% loss, HTTP/2 can be slower than HTTP/1.1 with 6 connections
Common Mistakes with Head-of-Line Blocking
- Assuming HTTP/2 is always faster than HTTP/1.1. On lossy networks (mobile, satellite), TCP-level HOL blocking can make HTTP/2 slower than HTTP/1.1 with parallel connections
- Not understanding that HTTP/2's multiplexing doesn't magically eliminate all blocking — it trades HTTP-level blocking for TCP-level blocking
- Ignoring the role of packet loss in performance analysis. HOL blocking is invisible at 0% loss and devastating at 2%+ loss
- Sharding resources across multiple domains (domain sharding) on HTTP/2. This was an HTTP/1.1 workaround that hurts HTTP/2 by preventing multiplexing
- Assuming QUIC's independent streams have zero cost — QUIC adds per-stream overhead and can be less efficient than TCP for ordered data like database queries
Tools for Head-of-Line Blocking
- Chrome DevTools (Network tab) (Open Source): Visualizing request waterfall and identifying blocked requests with the 'Stalled' timing indicator — Scale: Development
- WebPageTest (Open Source): Comparing HTTP/1.1 vs HTTP/2 vs HTTP/3 waterfalls across different network conditions and locations — Scale: Development-Production
- Wireshark (Open Source): Deep packet analysis of TCP retransmissions, QUIC streams, and protocol-level blocking events — Scale: Any
- h2load (Open Source): HTTP/2 and HTTP/3 load testing and benchmarking to measure real-world multiplexing performance — Scale: Development
Related to Head-of-Line Blocking
TCP Deep Dive, TCP Congestion Control, HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, HTTP/3 — UDP Takes Over, QUIC Protocol, Network Latency — Where Time Goes, Connection Pooling & Keep-Alive
HTTP/1.1 — The Foundation — Application Protocols
Difficulty: Beginner
Key Points for HTTP/1.1 — The Foundation
- HTTP/1.1 defaults to persistent connections (keep-alive) — closing is the exception, not the rule
- Pipelining was specified but never reliably implemented due to head-of-line blocking at the response level
- Chunked transfer encoding allows servers to stream responses of unknown length
- Conditional requests (If-None-Match, If-Modified-Since) save bandwidth by returning 304 Not Modified
- The Host header is mandatory in HTTP/1.1, enabling virtual hosting on a single IP address
Common Mistakes with HTTP/1.1 — The Foundation
- Ignoring Cache-Control and relying solely on ETag — they serve different purposes and work best together
- Assuming HTTP pipelining works in practice — most browsers disabled it due to buggy proxies
- Not setting Content-Length or using chunked encoding, causing clients to hang waiting for data
- Confusing 401 Unauthorized (needs authentication) with 403 Forbidden (authenticated but not authorized)
- Opening too many parallel connections to the same origin, triggering server-side rate limits
Tools for HTTP/1.1 — The Foundation
- curl (Open Source): CLI-based HTTP debugging and scripting — Scale: Single requests to millions via scripting
- Postman (Commercial): Interactive API exploration and team collaboration — Scale: Individual developer to large teams
- Apache HTTP Server (Open Source): Traditional web serving with rich module ecosystem — Scale: Small sites to enterprise deployments
- Nginx (Open Source): High-performance reverse proxy and static file serving — Scale: Thousands to millions of concurrent connections
Related to HTTP/1.1 — The Foundation
HTTP/2 — Multiplexing Revolution, HTTP/3 — UDP Takes Over, REST vs GraphQL vs gRPC, TLS Handshake — Step by Step, Head-of-Line Blocking, Connection Pooling & Keep-Alive, TCP Deep Dive
HTTP/2 — Multiplexing Revolution — Application Protocols
Difficulty: Intermediate
Key Points for HTTP/2 — Multiplexing Revolution
- HTTP/2 multiplexes all requests over a single TCP connection, eliminating the need for domain sharding
- HPACK header compression reduces header overhead by 85-90% compared to HTTP/1.1's repeated text headers
- Server push sounded great in theory but is being removed from most browsers due to poor real-world performance
- Stream prioritization lets clients hint which resources matter most, but server implementations vary wildly
- TCP-level head-of-line blocking still exists — a single lost packet blocks ALL streams on the connection
Common Mistakes with HTTP/2 — Multiplexing Revolution
- Still using domain sharding with HTTP/2 — this hurts performance by splitting the single-connection advantage
- Assuming server push will speed up page loads — in practice it often pushes resources the client already has cached
- Not enabling HTTP/2 on the backend — many teams only enable it at the CDN edge, missing internal benefits
- Ignoring stream priorities — unoptimized servers treat all streams equally, defeating the purpose
- Thinking HTTP/2 requires TLS — the spec allows plaintext (h2c), though browsers mandate TLS in practice
Tools for HTTP/2 — Multiplexing Revolution
- Nginx (Open Source): HTTP/2 termination and reverse proxying with battle-tested performance — Scale: Millions of concurrent connections
- Envoy Proxy (Open Source): HTTP/2 in service mesh environments with advanced observability — Scale: Cloud-native microservice architectures
- Cloudflare (Managed): Automatic HTTP/2 at the edge with zero server-side config — Scale: Global CDN scale
- HAProxy (Open Source): High-performance HTTP/2 load balancing with fine-grained control — Scale: Enterprise load balancing
Related to HTTP/2 — Multiplexing Revolution
HTTP/1.1 — The Foundation, HTTP/3 — UDP Takes Over, gRPC & Protocol Buffers, Head-of-Line Blocking, TCP Deep Dive, Connection Pooling & Keep-Alive, TLS Handshake — Step by Step
HTTP/3 — UDP Takes Over — Application Protocols
Difficulty: Advanced
Key Points for HTTP/3 — UDP Takes Over
- HTTP/3 eliminates TCP head-of-line blocking by running each stream as an independent QUIC stream over UDP
- 0-RTT resumption lets returning clients send data immediately, shaving an entire round trip off connection setup
- Connection migration means mobile users switching from WiFi to cellular don't lose their HTTP connection
- QUIC integrates TLS 1.3 directly into the transport layer — encryption isn't optional, it's structural
- QPACK replaces HPACK to avoid the head-of-line blocking that HPACK's dynamic table caused across streams
Common Mistakes with HTTP/3 — UDP Takes Over
- Assuming HTTP/3 is just HTTP/2 over UDP — QUIC is a complete transport protocol, not a thin wrapper
- Blocking UDP at the firewall and wondering why HTTP/3 doesn't work — many corporate networks block UDP
- Using 0-RTT without understanding replay attacks — 0-RTT data can be replayed by an attacker
- Not implementing fallback to HTTP/2 — some networks and middleboxes still don't support QUIC
- Expecting HTTP/3 to be faster in all cases — on reliable networks with low latency, the difference is minimal
Tools for HTTP/3 — UDP Takes Over
- Cloudflare (Managed): Automatic HTTP/3 with QUIC at global edge — flip a switch — Scale: Global CDN, millions of sites
- quiche (Cloudflare) (Open Source): Production QUIC implementation in Rust for custom integrations — Scale: Embedded in Cloudflare and Nginx
- msquic (Microsoft) (Open Source): Cross-platform QUIC library for Windows and Linux applications — Scale: Used in Windows, .NET, and Xbox
- nginx-quic (Open Source): Adding HTTP/3 support to existing Nginx deployments — Scale: Production web serving
Related to HTTP/3 — UDP Takes Over
HTTP/2 — Multiplexing Revolution, HTTP/1.1 — The Foundation, QUIC Protocol, UDP — When Speed Beats Safety, Head-of-Line Blocking, TLS Handshake — Step by Step, TCP Deep Dive, CDN & Edge Networking
HTTP Caching Deep Dive — Application Protocols
Difficulty: Intermediate
Key Points for HTTP Caching Deep Dive
- Cache-Control: no-cache does NOT mean 'do not cache.' It means 'cache the response but always revalidate with the origin before using it.' The directive that prevents storing is no-store.
- stale-while-revalidate allows the cache to serve a stale response immediately while fetching a fresh copy in the background — eliminating latency spikes during revalidation.
- The immutable directive tells the browser to never revalidate a resource, even on a hard reload. This is safe for fingerprinted assets like /app.a1b2c3.js and eliminates wasted conditional requests.
- s-maxage overrides max-age for shared caches (CDNs and proxies) without affecting the browser cache. This separates the CDN TTL from the browser TTL.
- The Vary header is a cache key modifier. Setting Vary: Accept-Encoding means the cache stores separate gzip and brotli versions. Setting Vary: Cookie effectively disables shared caching because every user's cookie differs.
Common Mistakes with HTTP Caching Deep Dive
- Confusing no-cache with no-store. Setting no-cache still caches the response — it just forces revalidation. Sensitive data (account pages, payment info) needs no-store.
- Serving static assets with short max-age instead of using fingerprinted filenames with immutable. Every deployment causes millions of conditional requests that all return 304.
- Setting Vary: Cookie on CDN-cached content, which creates a unique cache entry per user and makes the CDN effectively useless — a 0% hit rate.
- Not setting Cache-Control at all. Without explicit directives, browser heuristics apply — typically caching for 10% of the age since Last-Modified, which is unpredictable.
- Purging CDN caches by URL pattern without realizing that query parameters, Vary headers, and content negotiation create multiple cache entries per URL.
Tools for HTTP Caching Deep Dive
- Varnish (Open Source): High-performance HTTP reverse proxy cache with VCL scripting for custom cache logic — handles millions of req/sec in front of origin servers — Scale: Medium-Enterprise
- Nginx proxy_cache (Open Source): Built-in caching for Nginx reverse proxy, simple configuration, good for single-origin setups without complex invalidation needs — Scale: Small-Enterprise
- Cloudflare (Commercial): Global CDN with automatic caching, Tiered Cache to reduce origin load, and Cache Rules for fine-grained control without touching origin headers — Scale: Small-Enterprise
- Squid (Open Source): Forward proxy cache for corporate networks and ISPs — caches outbound traffic to reduce bandwidth consumption — Scale: Small-Enterprise
Related to HTTP Caching Deep Dive
HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, CDN & Edge Networking, Network Latency — Where Time Goes, API Gateway vs Load Balancer vs Reverse Proxy
IP Addressing & Subnetting — Foundations & Data Travel
Difficulty: Intermediate
Key Points for IP Addressing & Subnetting
- CIDR replaced classful addressing (Class A/B/C) in 1993. If someone talks about IP classes in production context, they are 30 years behind.
- Three private ranges: 10.0.0.0/8 (16M addresses), 172.16.0.0/12 (1M addresses), 192.168.0.0/16 (65K addresses).
- IPv4 addresses (4.3 billion) are exhausted. NAT and CIDR are the duct tape keeping IPv4 alive. IPv6 has 340 undecillion addresses.
- In cloud VPCs, subnet sizing is the first architecture decision and the hardest to change later.
- Always plan subnets with room to grow. A /24 gives 254 hosts — that feels huge until Kubernetes with 30 pods per node eats through them.
Common Mistakes with IP Addressing & Subnetting
- Making VPC subnets too small. A /28 gives only 11 usable IPs on AWS (16 minus 5 reserved). Kubernetes clusters burn through IPs fast.
- Using overlapping CIDR ranges across VPCs. This makes VPC peering impossible without ugly NAT workarounds.
- Forgetting that AWS, GCP, and Azure each reserve 3-5 IPs per subnet for infrastructure (gateway, DNS, broadcast, etc.).
- Treating IPv6 as optional. Major cloud providers and mobile carriers now use IPv6 by default — services need to handle it.
- Not documenting the IP address plan. Six months later nobody remembers which /16 was assigned to production vs staging.
Tools for IP Addressing & Subnetting
- AWS VPC (Managed): Cloud-native subnetting with integrated routing, security groups, and NACLs — Scale: Small-Enterprise
- GCP VPC (Managed): Global VPCs with automatic subnet creation per region — Scale: Medium-Enterprise
- Azure VNet (Managed): Enterprise hybrid cloud networking with ExpressRoute integration — Scale: Large-Enterprise
- ipcalc (Open Source): CLI tool for quick subnet calculation, CIDR math, and range validation — Scale: Small-Enterprise
Related to IP Addressing & Subnetting
NAT — Network Address Translation, DHCP Protocol, Routing & BGP Basics, OSI Model — The Real Version, VPN & Tunneling, Zero Trust Networking
IPv6 Deep Dive — Foundations & Data Travel
Difficulty: Intermediate
Key Points for IPv6 Deep Dive
- IPv4 address exhaustion is not hypothetical — IANA allocated the last /8 blocks in 2011, and all five RIRs have hit their final allocations
- IPv6 eliminates NAT as an architectural necessity — every device gets a globally routable address, restoring the end-to-end principle
- SLAAC enables truly zero-configuration networking: plug in a cable, receive a router advertisement, generate an address, and reach the internet
- The IPv6 header is simpler than IPv4 (40 bytes fixed, no checksum, no fragmentation by routers) — processing is faster at line rate
- Dual-stack is the dominant transition strategy — run both IPv4 and IPv6 simultaneously and let applications choose based on DNS responses
Common Mistakes with IPv6 Deep Dive
- Assuming IPv6 deployment can wait. Major mobile carriers (T-Mobile, Reliance Jio) are IPv6-only with NAT64 — applications that break on IPv6 are already losing users.
- Treating IPv6 as 'long IPv4' and trying to map IPv4 subnetting practices directly. IPv6 allocations are /48 per site and /64 per subnet — there is no reason to subnet-pinch.
- Forgetting that IPv6 has no broadcast. Multicast and anycast replace broadcast use cases — code that depends on broadcast (ARP, DHCP discover) must be reworked for NDP and DHCPv6.
- Disabling IPv6 on servers 'for security' without understanding the attack surface. This breaks SLAAC, NDP, and often causes DNS resolution delays due to AAAA query timeouts.
- Not testing NAT64/DNS64 compatibility. Applications that embed literal IPv4 addresses in payloads (SIP, FTP, game protocols) break silently behind NAT64 gateways.
Tools for IPv6 Deep Dive
- AWS VPC Dual-Stack (Managed): Native IPv6 support in VPCs with dual-stack subnets, ELBs, and egress-only internet gateways for IPv6 — Scale: Enterprise cloud
- GCP (Managed): Dual-stack VPCs with IPv6 support on load balancers, GKE pods, and Cloud DNS AAAA records — Scale: Enterprise cloud
- Hurricane Electric (he.net) (Free): IPv6 tunnel broker for networks without native IPv6 — provides a /48 allocation over a 6in4 tunnel — Scale: Individual to small enterprise
- Jool (NAT64) (Open Source): High-performance stateful NAT64 implementation for Linux, enabling IPv6-only networks to reach IPv4 servers — Scale: ISP and enterprise
Related to IPv6 Deep Dive
IP Addressing & Subnetting, NAT — Network Address Translation, DHCP Protocol, Routing & BGP Basics, Life of a Packet
Life of a Packet — Foundations & Data Travel
Difficulty: Beginner
Key Points for Life of a Packet
- A single HTTPS page load involves at least 4 different protocols working in sequence: DNS, TCP, TLS, HTTP.
- The first request to a new host is the most expensive — DNS lookup, TCP handshake, and TLS handshake all add latency.
- Subsequent requests on the same connection skip DNS (cached), TCP (keep-alive), and TLS (session resumption).
- A packet crosses 10-20 router hops on average to travel across the internet, each adding microseconds to milliseconds.
- Understanding this sequence is the foundation for optimizing web performance — every millisecond saved in the chain compounds.
Common Mistakes with Life of a Packet
- Assuming DNS is instant. A cold DNS lookup can take 20-120ms, and it happens before anything else.
- Forgetting that TLS adds round trips. TLS 1.2 adds 2 round trips; TLS 1.3 reduces this to 1, but it is still not free.
- Not reusing TCP connections. Each new connection costs a 3-way handshake — use HTTP keep-alive or connection pooling.
- Ignoring the return path. Packets can take different routes in each direction (asymmetric routing), causing confusing latency patterns.
- Blaming the server when the real bottleneck is the network. Use traceroute and packet captures before diving into application code.
Tools for Life of a Packet
- tcpdump (Open Source): Capturing the full packet lifecycle on a server with minimal overhead — Scale: Small-Enterprise
- Wireshark (Open Source): Visual analysis of the complete request sequence with timing breakdowns — Scale: Small-Enterprise
- Chrome DevTools (Network tab) (Open Source): Browser-side timing breakdown: DNS, TCP, TLS, TTFB, content download — Scale: Small-Enterprise
- mtr (My Traceroute) (Open Source): Combining ping and traceroute to show packet loss and latency at each hop — Scale: Small-Enterprise
Related to Life of a Packet
OSI Model — The Real Version, TCP Deep Dive, TLS Handshake — Step by Step, DNS Protocol Deep Dive, HTTP/1.1 — The Foundation, Network Latency — Where Time Goes, QUIC Protocol
Load Balancing Algorithms — Performance & Observability
Difficulty: Intermediate
Key Points for Load Balancing Algorithms
- Round robin is the simplest algorithm and works well when all backends are identical and requests are roughly equal cost — it fails when backends differ in capacity or requests vary in weight
- Consistent hashing minimizes cache disruption when backends are added or removed — only K/N keys move (K = total keys, N = backends) instead of rehashing everything
- Power of Two Choices (P2C) picks two random backends and routes to the one with fewer connections — this simple approach produces near-optimal distribution and is Envoy's default
- Maglev (Google) uses a lookup table that provides consistent hashing with perfectly uniform distribution and O(1) lookup time — designed for L4 load balancing at scale
- The wrong algorithm causes cascading failures: round robin during a partial outage keeps sending traffic to slow backends, turning a degradation into a full outage
Common Mistakes with Load Balancing Algorithms
- Using round robin with backends that have different CPU or memory capacities. A 2-core instance gets the same traffic as a 16-core instance, overloading the smaller one.
- Implementing consistent hashing without virtual nodes. With K backends and no virtual nodes, the hash space distribution is wildly uneven — some backends get 3x the traffic.
- Choosing least-connections for stateless HTTP APIs where all requests are equal cost. The connection tracking overhead provides no benefit — round robin is simpler and equivalent.
- Not implementing slow-start for new backends. A freshly started instance that receives its full share of traffic immediately may overwhelm cold caches and connection pools.
- Ignoring request cost variation. Least-connections assumes all connections are equal — a backend with 10 lightweight GETs appears busier than one with 2 heavy report-generation queries.
Tools for Load Balancing Algorithms
- HAProxy (Open Source): High-performance L4/L7 load balancing with round robin, least-connections, source hashing, and URI hashing built in — Scale: Millions of connections, single-node
- Envoy (Open Source): Service mesh sidecar with P2C, ring hash, Maglev, and zone-aware load balancing — the Istio and AWS App Mesh default — Scale: Cloud-native, per-pod sidecar
- NGINX (Open Source): L7 reverse proxy with round robin, least-connections, IP hash, and generic hash — the most deployed web server — Scale: Small to Enterprise
- AWS ALB (Managed): Managed L7 load balancing with least outstanding requests, round robin, and tight integration with ECS/EKS target groups — Scale: Enterprise cloud
Related to Load Balancing Algorithms
API Gateway vs Load Balancer vs Reverse Proxy, CDN & Edge Networking, Service Mesh Networking, Connection Pooling & Keep-Alive, Network Latency — Where Time Goes
Long Polling vs SSE vs WebSocket — Real-Time & Streaming
Difficulty: Intermediate
Key Points for Long Polling vs SSE vs WebSocket
- Long polling is the simplest to implement and works everywhere, but wastes resources on constant reconnection and cannot push data faster than the reconnect cycle.
- SSE is HTTP-native, auto-reconnects with Last-Event-ID, and works through proxies and CDNs — but is strictly one-way (server to client).
- WebSocket provides true bidirectional communication with minimal framing overhead, but requires special proxy configuration and has no built-in reconnection.
- For 90% of real-time needs (notifications, feeds, dashboards, AI streaming), SSE is the right choice. WebSocket is only necessary when the client sends frequent data too.
- HTTP/2 changes the equation significantly — SSE over HTTP/2 multiplexes perfectly, eliminating the connection-per-stream limitation that plagued SSE over HTTP/1.1.
Common Mistakes with Long Polling vs SSE vs WebSocket
- Defaulting to WebSocket for every real-time feature. Most use cases are server-push only, where SSE is simpler and more reliable.
- Implementing long polling without a timeout. The server must eventually respond (even with empty data) or proxies, load balancers, and browsers will kill the connection.
- Not handling WebSocket reconnection. Unlike SSE, WebSocket has no auto-reconnect. The application must implement retry logic, exponential backoff, and state recovery.
- Ignoring proxy and firewall compatibility. WebSocket requires proxy support for the Upgrade header. In corporate environments, this frequently fails silently.
- Using long polling when SSE is available. Long polling made sense in 2010 when IE did not support SSE. Today, EventSource is supported in all modern browsers.
Tools for Long Polling vs SSE vs WebSocket
- Socket.IO (Open Source): WebSocket with automatic fallback to long polling, rooms, namespaces, and reconnection built in — Scale: Small-Enterprise
- Native EventSource API (Open Source): Zero-dependency SSE consumption in browsers with automatic reconnection — Scale: Small-Enterprise
- SockJS (Open Source): WebSocket emulation with fallback transports for environments where WebSocket is blocked — Scale: Small-Medium
- Centrifugo (Open Source): Language-agnostic real-time messaging server supporting WebSocket, SSE, and HTTP streaming with pub/sub — Scale: Medium-Enterprise
Related to Long Polling vs SSE vs WebSocket
HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, WebSocket Protocol, Server-Sent Events (SSE), Connection Pooling & Keep-Alive, Head-of-Line Blocking, CORS — Cross-Origin Resource Sharing
MQTT & IoT Protocols — Real-Time & Streaming
Difficulty: Intermediate
Key Points for MQTT & IoT Protocols
- MQTT uses a publish/subscribe model where devices never communicate directly — the broker handles all routing, decoupling producers from consumers.
- The three QoS levels enable trading reliability for efficiency: QoS 0 for telemetry that can tolerate loss, QoS 2 for commands that must arrive exactly once.
- MQTT's overhead is tiny — a minimal packet is just 2 bytes. HTTP's minimum overhead is hundreds of bytes. This matters at scale with 10,000 battery-powered sensors.
- Retained messages solve the 'late joiner' problem — a new subscriber immediately gets the current state without waiting for the next publish cycle.
- Last Will and Testament (LWT) provides automatic offline detection. If a device loses connectivity, the broker publishes its pre-configured 'death' message.
Common Mistakes with MQTT & IoT Protocols
- Using QoS 2 for everything. The exactly-once four-packet handshake (PUBLISH → PUBREC → PUBREL → PUBCOMP) is expensive. Use QoS 0 for telemetry and QoS 1 for commands.
- Designing flat topic structures like device123-temperature. Use hierarchical topics (building/floor3/room301/temperature) to enable wildcard subscriptions.
- Publishing large payloads over MQTT. It supports up to 256MB messages, but it was designed for small sensor readings. For large data, use MQTT to signal availability and HTTP to download.
- Ignoring clean session semantics. With clean_session=false, the broker queues messages for offline clients. Thousands of offline devices with QoS 1 subscriptions can exhaust broker memory.
- Not setting up LWT messages. Without them, the system has no way to distinguish a device that has nothing to report from a device that has crashed.
Tools for MQTT & IoT Protocols
- Mosquitto (Open Source): Lightweight single-node MQTT broker, ideal for development and small deployments — Scale: Small-Medium
- HiveMQ (Commercial): Enterprise MQTT broker with clustering, monitoring dashboard, and Kafka bridge — Scale: Medium-Enterprise
- EMQX (Open Source): High-performance distributed MQTT broker handling millions of concurrent connections — Scale: Large-Enterprise
- AWS IoT Core (Managed): Fully managed MQTT broker integrated with AWS services (Lambda, DynamoDB, S3) — Scale: Medium-Enterprise
Related to MQTT & IoT Protocols
TCP Deep Dive, UDP — When Speed Beats Safety, WebSocket Protocol, Connection Pooling & Keep-Alive, TLS Handshake — Step by Step, Server-Sent Events (SSE)
mTLS — Mutual Authentication — Security & Encryption
Difficulty: Advanced
Key Points for mTLS — Mutual Authentication
- Standard TLS only authenticates the server. mTLS adds client authentication, creating a two-way identity verification.
- Service meshes like Istio and Linkerd automate mTLS transparently — application code never touches certificates.
- SPIFFE provides a standardized workload identity framework, and SPIRE is its production-grade implementation.
- Short-lived certificates (hours, not years) reduce the blast radius of key compromise and often eliminate the need for revocation.
- mTLS is the foundation of zero trust networking — every connection must prove identity, regardless of network location.
Common Mistakes with mTLS — Mutual Authentication
- Implementing mTLS at the application level instead of using a sidecar proxy or service mesh, creating massive maintenance burden.
- Using long-lived client certificates (years) that become impossible to rotate without coordinated downtime.
- Not validating the full certificate chain on both sides — just checking that a certificate exists is not enough.
- Forgetting to handle certificate rotation gracefully, causing connection drops when certs are renewed.
- Hardcoding trust anchors instead of loading them from a dynamic trust bundle that can be updated without redeployment.
Tools for mTLS — Mutual Authentication
- Istio (Open Source): Full service mesh with automatic mTLS, traffic management, and observability in Kubernetes — Scale: Enterprise
- Linkerd (Open Source): Lightweight service mesh focused on simplicity, automatic mTLS with minimal resource overhead — Scale: Small-Enterprise
- SPIRE (Open Source): Standalone workload identity and certificate issuance without requiring a full service mesh — Scale: Small-Enterprise
- HashiCorp Vault PKI (Open Source): Private CA with dynamic certificate issuance, fine-grained policies, and multi-cloud support — Scale: Enterprise
Related to mTLS — Mutual Authentication
TLS Handshake — Step by Step, Certificates & PKI, Zero Trust Networking, Service Mesh Networking, OAuth 2.0 & OIDC Flows
NAT — Network Address Translation — Foundations & Data Travel
Difficulty: Intermediate
Key Points for NAT — Network Address Translation
- NAT is the reason the internet still works on IPv4. Without it, we would have run out of addresses in the late 1990s.
- PAT (overloaded NAT) maps thousands of internal connections to a single public IP using different source ports.
- NAT breaks the end-to-end principle of IP — devices behind NAT cannot receive unsolicited inbound connections.
- NAT type (Full Cone, Restricted, Symmetric) determines whether P2P protocols like WebRTC can establish direct connections.
- Cloud NAT Gateways (AWS NAT GW, GCP Cloud NAT) cost real money — $0.045/hr + $0.045/GB on AWS. Optimize traffic to reduce costs.
Common Mistakes with NAT — Network Address Translation
- Assuming NAT provides security. NAT hides internal IPs but is not a firewall. It does not inspect or filter traffic.
- Running out of NAT ports. A single NAT device has 65,535 ports per protocol per public IP. High-connection services hit this limit.
- Forgetting NAT Gateway costs. An AWS NAT Gateway processing 1TB/month costs ~$90 — and that adds up across multiple AZs.
- Not understanding NAT type implications for real-time apps. Symmetric NAT makes WebRTC hole punching nearly impossible.
- Using a single NAT Gateway across multiple AZs. If that AZ goes down, all private subnets lose internet access.
Tools for NAT — Network Address Translation
- AWS NAT Gateway (Managed): Production-grade managed NAT with automatic scaling and HA within an AZ — Scale: Medium-Enterprise
- iptables / nftables (Open Source): Self-managed NAT on Linux — full control, no per-GB cost, but HA is on the operator — Scale: Small-Enterprise
- GCP Cloud NAT (Managed): Distributed NAT that scales per-VM without a single gateway instance — Scale: Medium-Enterprise
- fck-nat (Open Source): EC2-based NAT instance at 1/10th the cost of AWS NAT Gateway for dev/staging — Scale: Small-Enterprise
Related to NAT — Network Address Translation
IP Addressing & Subnetting, ARP & MAC Addresses, OSI Model — The Real Version, WebRTC — Peer-to-Peer, VPN & Tunneling, DHCP Protocol
Firewalls & Security Groups — Security & Encryption
Difficulty: Intermediate
Key Points for Firewalls & Security Groups
- Stateful firewalls (AWS Security Groups, iptables with conntrack) track connection state. Allowing inbound TCP/443 automatically permits the response packets. Stateless firewalls (NACLs) require explicit rules for both directions.
- iptables evaluates rules top-to-bottom in each chain. The first matching rule wins. A misplaced ACCEPT above a DROP renders the DROP unreachable — rule ordering is the most common source of firewall misconfigurations.
- AWS Security Groups are deny-by-default with allow-only rules. It is impossible to write a deny rule in a Security Group. To block specific IPs, use NACLs or WAF.
- Kubernetes pods are unrestricted by default — every pod can reach every other pod. The first NetworkPolicy applied to a pod activates filtering; from that point, only explicitly allowed traffic passes.
- Micro-segmentation — applying firewall rules per workload instead of per subnet — reduces the blast radius of a compromised host from the entire network to only the workloads it is explicitly allowed to reach.
Common Mistakes with Firewalls & Security Groups
- Opening port 0.0.0.0/0 in a Security Group for debugging and forgetting to remove it. This exposes the instance to the entire internet and is the leading cause of cloud breaches.
- Adding iptables rules to the wrong table. Filtering rules belong in the 'filter' table, not 'nat' or 'mangle.' Rules in the wrong table either have no effect or break NAT/routing.
- Assuming Kubernetes NetworkPolicy works without a supporting CNI. The default kubenet CNI does not enforce NetworkPolicy. Calico, Cilium, or a similar policy-aware CNI must be installed.
- Creating overlapping Security Group and NACL rules that conflict. Traffic must pass both — NACLs are evaluated first at the subnet level, then Security Groups at the instance level. A NACL deny overrides a Security Group allow.
- Not accounting for ephemeral ports in stateless NACLs. Outbound connections use random high ports (1024-65535). A NACL allowing only port 443 outbound blocks the return traffic from any connection initiated by the instance.
Tools for Firewalls & Security Groups
- iptables (Open Source): Traditional Linux packet filtering with mature tooling and documentation — still the default on most distributions, though being replaced by nftables — Scale: Small-Enterprise
- nftables (Open Source): Modern replacement for iptables with better performance, unified syntax for IPv4/IPv6/ARP, and native set/map support for efficient rule matching — Scale: Small-Enterprise
- AWS Security Groups (Managed): Stateful instance-level firewall integrated into AWS VPC — no agents to manage, automatic connection tracking, supports referencing other Security Groups as sources — Scale: Small-Enterprise
- Calico NetworkPolicy (Open Source): Kubernetes-native and extended network policy enforcement using iptables or eBPF — supports global policies, DNS-based rules, and application layer filtering — Scale: Medium-Enterprise
Related to Firewalls & Security Groups
Zero Trust Networking, DDoS & Rate Limiting, IP Addressing & Subnetting, Container Networking & Namespaces, eBPF for Networking
Network Latency — Where Time Goes — Performance & Observability
Difficulty: Intermediate
Key Points for Network Latency — Where Time Goes
- A cold HTTPS request from New York to London costs ~250ms minimum before a single byte of content arrives: DNS + TCP + TLS + TTFB
- Bandwidth and latency are fundamentally different — a 10 Gbps pipe doesn't help if RTT is 150ms. Latency is about distance; bandwidth is about width
- The speed of light in fiber is ~200,000 km/s (roughly 2/3 of vacuum speed), setting a hard physical floor on latency
- TLS 1.3 reduced the handshake from 2 RTTs to 1 RTT (and 0-RTT for resumption), which is why upgrading from TLS 1.2 matters
- Connection reuse (HTTP keep-alive, connection pooling) is the single most impactful latency optimization because it eliminates handshake costs entirely
Common Mistakes with Network Latency — Where Time Goes
- Optimizing bandwidth when latency is the bottleneck. A 1KB API response on a 100ms RTT link doesn't benefit from more bandwidth — the handshake overhead dominates
- Ignoring DNS resolution time. A cold DNS lookup to an authoritative server can add 50-200ms, and this happens before anything else
- Not enabling TLS 1.3. Sticking with TLS 1.2 adds an extra round-trip on every new connection — that's 50-150ms wasted per connection
- Measuring latency only from the data center. Real user latency includes last-mile ISP hops, which can add 10-50ms of jitter
- Assuming CDN solves everything. CDNs help with static content but dynamic API calls still hit origin servers — latency there is server think-time
Tools for Network Latency — Where Time Goes
- Chrome DevTools (Open Source): Waterfall breakdown of individual requests showing DNS, TCP, TLS, TTFB, and download phases — Scale: Development
- WebPageTest (Open Source): Multi-location testing with filmstrip view and connection-level timing from real browsers — Scale: Development-Production
- Lighthouse (Open Source): Automated performance auditing with actionable optimization suggestions — Scale: Development
- Catchpoint (Commercial): Synthetic monitoring from 800+ global locations with network-layer telemetry — Scale: Enterprise
Related to Network Latency — Where Time Goes
TCP Deep Dive, TLS Handshake — Step by Step, DNS Protocol Deep Dive, CDN & Edge Networking, Connection Pooling & Keep-Alive, HTTP/2 — Multiplexing Revolution, HTTP/3 — UDP Takes Over, Head-of-Line Blocking
Network Observability — Performance & Observability
Difficulty: Advanced
Key Points for Network Observability
- The four golden signals — latency, traffic, errors, saturation — are the minimum viable monitoring for any network. If only four things get tracked, make it these
- eBPF is the game-changer for network observability — it instruments the kernel without modifying code, adding latency, or requiring restarts
- Flow logs (NetFlow, sFlow, IPFIX) provide traffic-level visibility without packet capture overhead — essential for capacity planning and anomaly detection
- RED metrics (Rate, Errors, Duration) applied to network connections reveal issues that application-level metrics miss entirely
- Network observability is not network monitoring — monitoring tells the team something is broken, observability tells them WHY it broke
Common Mistakes with Network Observability
- Monitoring only at the application layer and missing network-level issues like packet loss, retransmissions, and routing changes that degrade performance silently
- Collecting too many metrics without aggregation. Per-connection metrics at high cardinality will overwhelm the monitoring system — aggregate by service, pod, or subnet
- Relying solely on SNMP polling at 5-minute intervals. Modern networks change in seconds — streaming telemetry is a must, not periodic polling
- Not correlating network metrics with application traces. A spike in TCP retransmissions might explain why API P99 latency jumped — but only if the data is overlaid
- Ignoring saturation metrics. CPU, memory, and bandwidth at 90% utilization don't trigger error metrics, but they cause tail latency spikes
Tools for Network Observability
- Cilium Hubble (Open Source): Kubernetes-native network observability using eBPF — service maps, flow visibility, and policy monitoring — Scale: Medium-Enterprise
- Prometheus + Grafana (Open Source): Metrics collection and visualization — pull-based model with PromQL for flexible querying and alerting — Scale: Any
- Grafana (Open Source): Unified dashboards combining network, infrastructure, and application metrics from multiple data sources — Scale: Any
- Datadog Network Monitoring (Managed): SaaS network performance monitoring with auto-discovery, flow maps, and DNS analytics across cloud and on-prem — Scale: Medium-Enterprise
Related to Network Observability
TCP Deep Dive, TCP Congestion Control, Network Latency — Where Time Goes, TCP/IP Debugging Toolkit, Service Mesh Networking, eBPF for Networking, Life of a Packet, Head-of-Line Blocking
OAuth 2.0 & OIDC Flows — Security & Encryption
Difficulty: Intermediate
Key Points for OAuth 2.0 & OIDC Flows
- OAuth 2.0 is an authorization framework, not an authentication protocol. OIDC adds the authentication layer on top.
- Authorization Code + PKCE is the recommended flow for all clients — SPAs, mobile apps, and server-side apps.
- The Implicit flow is deprecated because it exposes tokens in the URL fragment, vulnerable to history and referrer leaks.
- Client Credentials flow is for machine-to-machine communication — no user involved, the client authenticates with its own credentials.
- Refresh token rotation (issuing a new refresh token with each use) prevents stolen refresh tokens from being used indefinitely.
Common Mistakes with OAuth 2.0 & OIDC Flows
- Using the Implicit flow for SPAs. It was deprecated in the OAuth 2.0 Security BCP. Use Authorization Code + PKCE instead.
- Storing tokens in localStorage where they are accessible to any JavaScript on the page via XSS attacks.
- Not validating JWT signatures on the resource server. Accepting any well-formed JWT without checking the signature is an open door.
- Using overly broad scopes. Tokens should request the minimum scopes needed — 'read:orders' not 'admin'.
- Treating the access token as an identity assertion. Access tokens prove authorization, not identity. Use the ID token for identity.
Tools for OAuth 2.0 & OIDC Flows
- Auth0 (Managed): Full-featured identity platform with Universal Login, MFA, and extensive SDKs — Scale: Small-Enterprise
- Keycloak (Open Source): Self-hosted identity provider with OIDC, SAML, LDAP federation, and fine-grained authorization — Scale: Small-Enterprise
- Okta (Commercial): Enterprise workforce identity with SSO, lifecycle management, and compliance certifications — Scale: Enterprise
- AWS Cognito (Managed): Serverless-friendly user pools with built-in hosted UI, integrated with API Gateway and ALB — Scale: Small-Enterprise
Related to OAuth 2.0 & OIDC Flows
TLS Handshake — Step by Step, Certificates & PKI, CORS — Cross-Origin Resource Sharing, REST vs GraphQL vs gRPC, HTTP/1.1 — The Foundation, API Gateway vs Load Balancer vs Reverse Proxy
OSI Model — The Real Version — Foundations & Data Travel
Difficulty: Beginner
Key Points for OSI Model — The Real Version
- The textbook 7-layer OSI model is a teaching tool. In practice, the TCP/IP 4-layer model is what runs the internet.
- Layer 2 (Link) handles a single network segment. Layer 3 (IP) handles routing across the internet.
- TCP and UDP are the only two transport protocols that matter for 99% of production systems.
- Most debugging starts at Layer 7 and works downward — but the best engineers know which layer to jump to.
- Encapsulation is the key concept: each layer adds a header, and the receiving side strips them in reverse order.
Common Mistakes with OSI Model — The Real Version
- Memorizing the 7-layer OSI model for interviews without understanding what each layer actually does in practice.
- Confusing Layer 4 (Transport) with Layer 7 (Application) when configuring load balancers, leading to wrong routing behavior.
- Assuming all network problems are application-layer issues. Sometimes the problem is MTU, ARP, or a routing table.
- Ignoring the Link layer entirely. MAC address conflicts and ARP issues can be incredibly hard to debug without understanding Layer 2.
- Treating layers as strict boundaries. In reality, protocols like QUIC deliberately blur layers for performance.
Tools for OSI Model — The Real Version
- Wireshark (Open Source): Deep packet inspection across all layers with a GUI — Scale: Small-Enterprise
- tcpdump (Open Source): CLI-based packet capture on servers, lightweight and scriptable — Scale: Small-Enterprise
- Netcat (nc) (Open Source): Quick TCP/UDP connectivity testing between hosts — Scale: Small-Enterprise
- Packet Sender (Open Source): Sending and receiving TCP/UDP/SSL packets with a simple UI — Scale: Small-Enterprise
Related to OSI Model — The Real Version
Life of a Packet, TCP Deep Dive, UDP — When Speed Beats Safety, ARP & MAC Addresses, Routing & BGP Basics, TCP/IP Debugging Toolkit
Proxy Protocols — Forward, Reverse & SOCKS — Foundations & Data Travel
Difficulty: Intermediate
Key Points for Proxy Protocols — Forward, Reverse & SOCKS
- A forward proxy acts on behalf of the client — the server sees the proxy's IP, not the client's. A reverse proxy acts on behalf of the server — the client sees the proxy's IP, not the server's.
- HTTP CONNECT creates an opaque TCP tunnel through a forward proxy. The proxy relays bytes without inspection, which is how HTTPS works through corporate proxies — the proxy never sees the encrypted content.
- SOCKS5 operates at Layer 4 (transport), making it protocol-agnostic. It can tunnel HTTP, SSH, database connections, or any TCP/UDP traffic — unlike HTTP proxies which only understand HTTP.
- The PROXY Protocol (v1 and v2) solves the client IP preservation problem for L4 load balancers. Without it, HAProxy, AWS NLB, and similar L4 proxies replace the client IP with the proxy's IP in the TCP header.
- Transparent proxies intercept traffic without client configuration — the client does not know it is being proxied. Explicit proxies require client configuration (browser proxy settings, HTTP_PROXY env var).
Common Mistakes with Proxy Protocols — Forward, Reverse & SOCKS
- Using a forward proxy to cache HTTPS traffic without understanding the implications. HTTPS through CONNECT creates an opaque tunnel — the proxy cannot cache what it cannot see without performing TLS interception (MITM).
- Forgetting to set X-Forwarded-For and X-Real-IP headers in reverse proxy configurations. Without these, the application sees every request as coming from the proxy's IP, breaking rate limiting, geo-location, and audit logging.
- Enabling PROXY Protocol on a listener that also accepts direct connections. PROXY Protocol prepends a binary or text header — clients connecting directly (without the header) get malformed request errors.
- Configuring SOCKS5 proxy without authentication in production. An open SOCKS proxy becomes a relay for spam, attacks, and data exfiltration — a liability that gets the proxy's IP blacklisted.
- Assuming a reverse proxy adds security by default. A misconfigured reverse proxy that passes through Host headers, allows request smuggling, or leaks internal paths can amplify vulnerabilities rather than mitigate them.
Tools for Proxy Protocols — Forward, Reverse & SOCKS
- Squid (Open Source): Forward proxy with caching, access control, and SSL bumping — the standard choice for corporate internet gateways and content filtering — Scale: Small-Enterprise
- NGINX (Open Source): Reverse proxy with load balancing, TLS termination, and caching — handles millions of concurrent connections with event-driven architecture — Scale: Small-Enterprise
- HAProxy (Open Source): High-performance L4/L7 load balancer with PROXY Protocol support — excels at TCP proxying where connection metadata preservation is critical — Scale: Medium-Enterprise
- mitmproxy (Open Source): Interactive HTTPS proxy for debugging and testing — performs TLS interception with a local CA to inspect encrypted traffic during development — Scale: Small
Related to Proxy Protocols — Forward, Reverse & SOCKS
API Gateway vs Load Balancer vs Reverse Proxy, TLS Handshake — Step by Step, NAT — Network Address Translation, HTTP/1.1 — The Foundation, Life of a Packet
QUIC Protocol — Transport & Reliability
Difficulty: Advanced
Key Points for QUIC Protocol
- QUIC reduces connection establishment from 3 RTTs (TCP handshake + TLS 1.3) to 1 RTT for new connections and 0 RTT for repeat connections
- Stream-level multiplexing eliminates head-of-line blocking — the fundamental problem that HTTP/2 over TCP can never solve
- Connection migration via connection IDs means switching from WiFi to cellular doesn't drop the HTTP/3 connection
- Running over UDP means QUIC can be updated at the application layer without waiting for OS kernel updates to the TCP stack
- All QUIC packets (except the initial handshake) are encrypted, including headers — middleboxes cannot inspect or modify the transport layer
Common Mistakes with QUIC Protocol
- Thinking QUIC is just 'TCP over UDP.' QUIC is a complete reimagining of transport with features TCP cannot provide (stream multiplexing, connection migration)
- Assuming 0-RTT is always safe. 0-RTT data is replayable — an attacker can capture and resend it. Only use 0-RTT for idempotent requests
- Blocking QUIC at the firewall and not knowing it. Many corporate firewalls block UDP 443, causing clients to silently fall back to TCP — check the metrics
- Ignoring UDP rate limiting on servers. Some cloud providers rate-limit UDP, which throttles QUIC before the protocol can optimize
- Not measuring QUIC vs TCP performance in the actual environment. QUIC wins big on lossy mobile networks but may show minimal improvement on low-latency data center links
Tools for QUIC Protocol
- quiche (Cloudflare) (Open Source): Rust-based QUIC implementation, used in Cloudflare's edge network — Scale: Large-Enterprise
- ngtcp2 (Open Source): C-based QUIC library, powers curl's HTTP/3 support — Scale: Any
- msquic (Microsoft) (Open Source): Cross-platform QUIC for Windows, Linux, macOS — used in Windows networking stack — Scale: Enterprise
- Google QUIC (gQUIC) (Open Source): Original QUIC implementation in Chromium, battle-tested at Google scale — Scale: Large-Enterprise
Related to QUIC Protocol
TCP Deep Dive, UDP — When Speed Beats Safety, TLS Handshake — Step by Step, HTTP/3 — UDP Takes Over, Head-of-Line Blocking, Connection Pooling & Keep-Alive, Network Latency — Where Time Goes, CDN & Edge Networking
REST vs GraphQL vs gRPC — Application Protocols
Difficulty: Intermediate
Key Points for REST vs GraphQL vs gRPC
- There is no universally best choice — the right answer depends on the client types, team size, performance needs, and caching requirements
- REST is the default choice for public APIs because every language, tool, and developer already knows HTTP
- GraphQL solves the over-fetching/under-fetching problem but introduces query complexity, N+1 issues, and caching challenges
- gRPC is 5-10x faster than JSON-based APIs but sacrifices human readability and browser compatibility
- Many production systems use all three — REST for public APIs, GraphQL for mobile/frontend BFFs, gRPC for internal services
Common Mistakes with REST vs GraphQL vs gRPC
- Choosing GraphQL because it's trendy without considering the operational complexity of query analysis and N+1 prevention
- Using REST for internal high-throughput service-to-service calls where gRPC would eliminate serialization overhead
- Building a GraphQL API without implementing query depth limiting and cost analysis — opening the system to denial-of-service
- Assuming gRPC replaces REST for public APIs — browser support requires gRPC-Web, which adds deployment complexity
- Over-engineering with multiple paradigms when a simple REST API with good pagination would suffice
Tools for REST vs GraphQL vs gRPC
- Express / Fastify (REST) (Open Source): Building REST APIs in Node.js with minimal boilerplate and maximum ecosystem support — Scale: Startups to enterprise
- Apollo Server (GraphQL) (Open Source): Full-featured GraphQL server with federation, caching, and extensive plugin ecosystem — Scale: Medium to large frontend-driven applications
- gRPC-Go / gRPC-Java (Open Source): High-performance internal service communication with code generation from protobuf — Scale: Google-scale microservice architectures
- Hasura (Open Source): Instant GraphQL API over PostgreSQL with real-time subscriptions and authorization — Scale: Rapid prototyping to production
Related to REST vs GraphQL vs gRPC
gRPC & Protocol Buffers, HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, API Gateway vs Load Balancer vs Reverse Proxy, CORS — Cross-Origin Resource Sharing, Service Mesh Networking
Routing & BGP Basics — Foundations & Data Travel
Difficulty: Advanced
Key Points for Routing & BGP Basics
- BGP is the protocol that glues the entire internet together. Every ISP, cloud provider, and CDN uses it.
- BGP selects routes based on a priority chain: local preference → AS path length → origin type → MED → eBGP over iBGP → lowest router ID.
- A BGP misconfiguration can take down portions of the internet. Facebook's October 2021 outage was caused by a bad BGP withdrawal.
- Interior Gateway Protocols (OSPF, IS-IS) handle routing within an AS. BGP handles routing between ASes.
- BGP is a policy-based protocol. Unlike IGPs that find the shortest path, BGP lets operators express business relationships through routing policy.
Common Mistakes with Routing & BGP Basics
- Announcing prefixes the AS does not own. Without RPKI validation, anyone can claim any IP prefix — this is a BGP hijack.
- Not implementing maximum prefix limits on BGP sessions. A peer leaking a full table (900K+ routes) can overflow the router's memory.
- Ignoring BGP convergence time. After a failure, BGP can take 30-90 seconds to converge — an eternity for real-time traffic.
- Using BGP for internal routing when OSPF or IS-IS would be simpler and converge faster. BGP inside an AS adds unnecessary complexity.
- Not deploying RPKI/ROA to validate route origins. This is the single most impactful action for preventing route hijacking.
Tools for Routing & BGP Basics
- BIRD (Open Source): Full-featured BGP daemon used by major IXPs and hosting companies — Scale: Large-Enterprise
- FRRouting (FRR) (Open Source): Multi-protocol routing suite (BGP, OSPF, IS-IS) for Linux — successor to Quagga — Scale: Medium-Enterprise
- AWS Direct Connect (Managed): Private BGP peering with AWS over dedicated fiber, bypassing the public internet — Scale: Large-Enterprise
- Cloudflare Magic Transit (Managed): BGP-based DDoS protection — announce prefixes through Cloudflare's network — Scale: Large-Enterprise
Related to Routing & BGP Basics
OSI Model — The Real Version, IP Addressing & Subnetting, DNS Protocol Deep Dive, CDN & Edge Networking, Network Latency — Where Time Goes, Network Observability
Serialization & Wire Formats — Application Protocols
Difficulty: Intermediate
Key Points for Serialization & Wire Formats
- JSON is human-readable but verbose — field names repeat in every record. A 1KB JSON payload often compresses to 300-400 bytes with Protobuf because Protobuf uses field numbers (1-2 bytes) instead of field name strings.
- Protobuf uses schema-on-write: the schema (.proto file) is compiled into the sender and receiver. Both sides must have compatible schemas. Avro uses schema-on-read: the writer's schema is included or referenced in the payload, so the reader can handle any version.
- Schema evolution is the real differentiator for long-lived systems. Protobuf allows adding optional fields and deprecating old ones as long as field numbers are never reused. Avro allows adding fields with defaults and renaming via aliases.
- MessagePack is 'binary JSON' — it maps directly to JSON types (map, array, string, int) but uses a compact binary encoding. It requires no schema and no code generation, making it a drop-in replacement for JSON with 30-50% smaller payloads.
- FlatBuffers and Cap'n Proto are zero-copy formats — the serialized bytes can be accessed directly without parsing into an intermediate object. This eliminates deserialization cost entirely, which matters for latency-critical paths.
Common Mistakes with Serialization & Wire Formats
- Using JSON for high-throughput internal service communication. Parsing JSON at 100,000 messages per second consumes measurable CPU — Protobuf or Avro at the same throughput uses 3-5x less CPU for serialization/deserialization.
- Reusing Protobuf field numbers after deleting a field. If field 3 was a string and gets reassigned to an int, old clients reading new messages interpret the bytes as a string, causing silent data corruption.
- Choosing Avro without deploying a Schema Registry. Without the registry, writers embed the full schema in every message (or readers have no way to resolve the writer's schema), either bloating payloads or breaking deserialization.
- Assuming binary formats are always faster. For small payloads (< 100 bytes) with simple structure, JSON serialization in modern runtimes (simdjson, orjson) can match or beat Protobuf due to lower fixed overhead and no code generation step.
- Not versioning schemas from day one. Retrofitting schema evolution into a system that started with unstructured JSON means migrating every producer and consumer simultaneously — a coordination nightmare that grows with the number of services.
Tools for Serialization & Wire Formats
- Protocol Buffers (Protobuf) (Open Source): Strongly typed RPC communication (gRPC) between microservices — excellent schema evolution, wide language support, and compact binary encoding — Scale: Medium-Enterprise
- Apache Avro (Open Source): Event streaming and data pipelines (Kafka, Hadoop) where schema-on-read flexibility and schema registry integration matter more than raw encoding speed — Scale: Medium-Enterprise
- MessagePack (Open Source): Drop-in binary replacement for JSON in APIs and caches — no schema required, smaller payloads, faster parsing, compatible with dynamic languages — Scale: Small-Enterprise
- FlatBuffers (Open Source): Zero-copy access for game engines, mobile apps, and latency-critical systems where deserialization cost must be eliminated entirely — Scale: Medium-Enterprise
Related to Serialization & Wire Formats
gRPC & Protocol Buffers, REST vs GraphQL vs gRPC, MQTT & IoT Protocols, HTTP/2 — Multiplexing Revolution, Network Latency — Where Time Goes
Server-Sent Events (SSE) — Real-Time & Streaming
Difficulty: Beginner
Key Points for Server-Sent Events (SSE)
- SSE uses plain HTTP — no protocol upgrade, no special handshake. Any HTTP server, proxy, or CDN can serve it without configuration changes.
- The EventSource API automatically reconnects on disconnect with exponential backoff, sending the Last-Event-ID header so the server can resume.
- SSE supports named event types, enabling multiplexed data streams (notifications, progress, updates) over a single connection.
- SSE is making a major comeback because of LLM streaming — ChatGPT, Claude, and most AI APIs stream token-by-token responses via SSE.
- Unlike WebSocket, SSE works through HTTP/2 multiplexing, meaning multiple SSE streams share a single TCP connection without head-of-line blocking.
Common Mistakes with Server-Sent Events (SSE)
- Using WebSocket when only server-to-client push is needed. SSE is simpler, auto-reconnects, and works through HTTP infrastructure natively.
- Forgetting to set Content-Type to text/event-stream. Without it, browsers will not parse the stream as events.
- Running SSE behind a buffering reverse proxy (like nginx with default settings) that holds the response until the connection closes instead of streaming chunks.
- Not implementing Last-Event-ID on the server side, causing clients to miss events after reconnection.
- Ignoring the 6-connection-per-domain limit in HTTP/1.1. With HTTP/2 this is not an issue, but on HTTP/1.1, each SSE stream consumes one of those precious slots.
Tools for Server-Sent Events (SSE)
- Native EventSource API (Open Source): Simple browser-native SSE consumption with zero dependencies — Scale: Small-Enterprise
- eventsource (npm polyfill) (Open Source): Node.js SSE client or adding custom headers (auth tokens) that native EventSource does not support — Scale: Small-Enterprise
- Mercure (Open Source): SSE hub with pub/sub topics, JWT auth, and built-in reconnection handling — Scale: Medium-Enterprise
- Pushpin (Open Source): Reverse proxy that adds SSE and WebSocket push capabilities to any REST API — Scale: Medium-Enterprise
Related to Server-Sent Events (SSE)
HTTP/1.1 — The Foundation, HTTP/2 — Multiplexing Revolution, WebSocket Protocol, Long Polling vs SSE vs WebSocket, Connection Pooling & Keep-Alive, CORS — Cross-Origin Resource Sharing
Service Discovery & mDNS — Modern Patterns
Difficulty: Intermediate
Key Points for Service Discovery & mDNS
- Client-side discovery (the client queries a registry and picks an instance) gives maximum flexibility but pushes load balancing logic into every consumer
- Server-side discovery (a load balancer or DNS sits between client and registry) centralizes routing but adds a hop and a potential single point of failure
- mDNS uses multicast UDP on 224.0.0.251:5353 and the .local TLD — no infrastructure required, but limited to the local broadcast domain
- Kubernetes combines server-side discovery (ClusterIP Services resolved by CoreDNS) with client-side patterns (headless Services returning all pod IPs)
- Health checking is not optional — a registry full of dead instances is worse than no registry at all, because callers waste time connecting to corpses
Common Mistakes with Service Discovery & mDNS
- Registering a service instance without a health check. The instance crashes, the registry still routes traffic to it, and callers see connection refused for the full TTL.
- Using DNS-based discovery with high TTLs for rapidly scaling services. DNS caches stale records — a service that scaled from 3 to 30 instances still gets traffic to only the original 3.
- Confusing Kubernetes ClusterIP Services with headless Services. ClusterIP gives a single virtual IP (server-side discovery). Headless returns all pod IPs (client-side discovery). The load balancing behavior is fundamentally different.
- Running mDNS across subnets without a reflector. mDNS is multicast-scoped to the local link — it does not cross routers without an explicit mDNS reflector or gateway.
- Treating service discovery as fire-and-forget. Registry data goes stale when instances fail to deregister on shutdown. Implement graceful deregistration in the shutdown hook AND rely on TTL-based expiry as a safety net.
Tools for Service Discovery & mDNS
- Consul (Open Source): Multi-datacenter service discovery with built-in health checking, KV store, and service mesh (Connect) — Scale: Enterprise, multi-cloud
- CoreDNS (Open Source): Kubernetes-native DNS-based service discovery with a plugin architecture for extensibility — Scale: Cloud-native clusters
- etcd (Open Source): Distributed key-value store used as the backing store for Kubernetes and as a service registry for custom discovery — Scale: Cluster-level, strongly consistent
- ZooKeeper (Open Source): Mature coordination service with ephemeral nodes for automatic deregistration — battle-tested in Hadoop and Kafka ecosystems — Scale: Enterprise, Java-centric
Related to Service Discovery & mDNS
DNS Protocol Deep Dive, Service Mesh Networking, Container Networking & Namespaces, API Gateway vs Load Balancer vs Reverse Proxy, Connection Pooling & Keep-Alive
Service Mesh Networking — Modern Patterns
Difficulty: Advanced
Key Points for Service Mesh Networking
- The data plane (sidecar proxies) handles every packet, while the control plane tells those proxies what to do — separating concerns is the core design principle.
- mTLS is automatic in a service mesh — the control plane acts as a certificate authority, issuing short-lived certs and rotating them without application changes.
- Traffic shifting enables canary deployments by routing 1% of traffic to a new version, observing error rates, and gradually increasing — all via config, not code.
- Circuit breaking in the proxy prevents cascading failures by stopping requests to an unhealthy upstream once error thresholds are breached.
- Ambient mesh (Istio's sidecar-less mode) moves L4 functionality to a per-node ztunnel and L7 to shared waypoint proxies, reducing resource overhead by 50-90%.
Common Mistakes with Service Mesh Networking
- Deploying a service mesh before it is actually needed. With fewer than 10 services, the operational complexity likely outweighs the benefits.
- Not accounting for sidecar resource consumption — each Envoy sidecar uses 50-100MB RAM and adds 1-3ms p99 latency per hop.
- Assuming the mesh handles application-level retries correctly. If the app also retries, the result is retry amplification (3 app retries x 3 mesh retries = 9 attempts).
- Ignoring sidecar injection failures. If a pod starts without its sidecar, it bypasses all mesh policies including mTLS, creating a security hole.
- Not setting proper timeout budgets. A 30s timeout on service A calling service B with a 30s timeout on B calling C means A could wait 60s+ in chain.
Tools for Service Mesh Networking
- Istio (Open Source): Feature-complete mesh with advanced traffic management, security policies, and multi-cluster support — Scale: Medium-Enterprise
- Linkerd (Open Source): Lightweight, simple mesh with minimal resource overhead and fast startup — ideal for teams that want mTLS and observability without complexity — Scale: Small-Enterprise
- Consul Connect (Open Source): HashiCorp ecosystem integration with service discovery built-in, works across Kubernetes and VMs — Scale: Medium-Enterprise
- Cilium Service Mesh (Open Source): eBPF-powered mesh that avoids sidecars entirely for L3/L4, reducing latency and resource usage — Scale: Medium-Enterprise
Related to Service Mesh Networking
mTLS — Mutual Authentication, gRPC & Protocol Buffers, HTTP/2 — Multiplexing Revolution, Zero Trust Networking, eBPF for Networking, Network Observability
SMTP & Email Protocols — Application Protocols
Difficulty: Intermediate
Key Points for SMTP & Email Protocols
- Email delivery is a multi-hop process: sender client → sender MTA → DNS MX lookup → recipient MTA → recipient IMAP server → client
- SPF, DKIM, and DMARC work together — SPF validates the sending server, DKIM signs the message content, DMARC sets the policy
- SMTP uses a store-and-forward model — each server accepts the message and takes responsibility for delivery or bounce
- Email deliverability depends on IP reputation, authentication records, content quality, and recipient engagement
- IMAP keeps mail on the server and syncs state across devices; POP3 downloads and (optionally) deletes from server
Common Mistakes with SMTP & Email Protocols
- Not setting up SPF, DKIM, and DMARC records — without all three, email will land in spam
- Using a shared IP for transactional email — one bad neighbor's spam can tank the sender's IP reputation
- Sending from a new domain/IP without warming up — ISPs throttle unknown senders aggressively
- Not handling SMTP bounce codes correctly — soft bounces (4xx) should retry, hard bounces (5xx) should remove the address
- Assuming email delivery is instant — SMTP allows servers to queue and retry for up to 5 days
Tools for SMTP & Email Protocols
- Postfix (Open Source): Self-hosted MTA with excellent performance and security defaults — Scale: Handles millions of messages per day on modest hardware
- SendGrid (Managed): Transactional and marketing email with deliverability optimization and analytics — Scale: Sends 100+ billion emails per month across all customers
- AWS SES (Managed): Cost-effective email sending integrated with AWS infrastructure — Scale: Pay-per-email with dedicated IPs available
- Mailgun (Managed): Developer-focused email API with powerful log search and email validation — Scale: Startup to enterprise transactional email
Related to SMTP & Email Protocols
DNS Protocol Deep Dive, TLS Handshake — Step by Step, TCP Deep Dive, Certificates & PKI, Life of a Packet
Socket Programming Mental Model — Transport & Reliability
Difficulty: Advanced
Key Points for Socket Programming Mental Model
- A socket is just a file descriptor — read(), write(), and close() work on it like any other file. This is the Unix 'everything is a file' philosophy applied to networking
- The listen() backlog is NOT the max concurrent connections — it's the queue of connections that have completed the 3-way handshake but haven't been accept()ed yet
- accept() returns a BRAND NEW file descriptor for each client connection. The original listening socket stays open, ready for the next client
- Blocking I/O means one thread per connection, which doesn't scale past ~10K connections. Non-blocking I/O with epoll/kqueue handles millions
- The C10K problem (handling 10,000 concurrent connections) was solved by moving from thread-per-connection to event-driven I/O — this is how nginx, Node.js, and Go's runtime work
Common Mistakes with Socket Programming Mental Model
- Forgetting SO_REUSEADDR when restarting a server. Without it, bind() fails with 'Address already in use' because the old socket is in TIME_WAIT
- Setting the listen backlog too small. Under burst traffic, new connections get dropped with TCP RST before accept() can process them
- Assuming one read() returns one complete message. TCP is a byte stream — a single read() may return half a message or three messages concatenated
- Blocking on accept() in a single-threaded server. While waiting for a new connection, existing clients can't be served — use I/O multiplexing
- Not handling EINTR (interrupted system call). Signals can interrupt any blocking syscall — always retry on EINTR
Tools for Socket Programming Mental Model
- epoll (Linux) (Open Source): High-performance I/O multiplexing on Linux — O(1) for ready events, handles millions of fds — Scale: Any
- kqueue (BSD/macOS) (Open Source): I/O multiplexing on FreeBSD and macOS with unified event notification for sockets, files, signals, and timers — Scale: Any
- io_uring (Linux 5.1+) (Open Source): Zero-copy, zero-syscall async I/O — the future of Linux networking for maximum throughput — Scale: Large-Enterprise
- libuv (Open Source): Cross-platform async I/O library — powers Node.js, uses epoll/kqueue/IOCP under the hood — Scale: Any
Related to Socket Programming Mental Model
TCP Deep Dive, UDP — When Speed Beats Safety, Connection Pooling & Keep-Alive, Life of a Packet, Head-of-Line Blocking, TCP/IP Debugging Toolkit, eBPF for Networking
TCP Congestion Control — Transport & Reliability
Difficulty: Advanced
Key Points for TCP Congestion Control
- Congestion control is about the NETWORK capacity, not the receiver's capacity — it prevents routers from dropping packets due to overloaded queues
- Without congestion control, TCP would cause congestion collapse — the internet literally stopped working in 1986 before Jacobson's fixes
- CUBIC is the default on Linux since 2.6.19 — it uses a cubic function to probe bandwidth more aggressively than Reno on high-BDP links
- BBR (Bottleneck Bandwidth and RTT) fundamentally changed the game by modeling the network instead of reacting to loss
- Congestion control algorithms are NOT interchangeable — BBR and CUBIC competing on the same bottleneck can cause unfairness
Common Mistakes with TCP Congestion Control
- Confusing cwnd with rwnd. Flow control (rwnd) protects the receiver; congestion control (cwnd) protects the network. The sender uses min(cwnd, rwnd)
- Thinking slow start is slow. It doubles cwnd every RTT — a 10 MSS initial window reaches 10,240 segments in just 10 RTTs. It's exponential growth.
- Deploying BBR without understanding its fairness implications. BBR v1 is known to starve CUBIC flows sharing the same bottleneck
- Ignoring initial cwnd tuning. Linux defaults to initcwnd=10; Google showed that increasing to 10 (from the old default of 3) improved page load times by 10%
- Not monitoring congestion metrics. Optimization without measurement is guesswork — track retransmission rate, RTT variance, and cwnd over time
Tools for TCP Congestion Control
- CUBIC (Open Source): General-purpose default on Linux; good for most workloads without tuning — Scale: Any
- BBR (v2/v3) (Open Source): High-latency links, lossy networks (cellular, satellite), video streaming — Scale: Large-Enterprise
- Reno/NewReno (Open Source): Legacy systems, textbook reference implementation, low-BDP links — Scale: Any
- DCTCP (Open Source): Data center networks with ECN support — maintains ultra-low latency at high utilization — Scale: Enterprise Data Centers
Related to TCP Congestion Control
TCP Deep Dive, QUIC Protocol, Head-of-Line Blocking, Network Latency — Where Time Goes, CDN & Edge Networking, Life of a Packet, Network Observability
TCP Deep Dive — Transport & Reliability
Difficulty: Intermediate
Key Points for TCP Deep Dive
- TCP provides reliable, ordered, byte-stream delivery over an unreliable network — it is the workhorse of the internet
- The 3-way handshake costs one full RTT before any data flows, making connection setup the dominant cost for short-lived requests
- Flow control via the receive window prevents a fast sender from overwhelming a slow receiver — this is per-connection, not per-network
- Window scaling (RFC 7323) extends the 16-bit window field to support high-bandwidth, high-latency links like satellite or cross-continent
- TIME_WAIT exists for a reason: it prevents old duplicate segments from corrupting a new connection on the same port tuple
Common Mistakes with TCP Deep Dive
- Confusing flow control (receiver-driven, sliding window) with congestion control (network-driven, cwnd). They are independent mechanisms that both limit send rate
- Ignoring TIME_WAIT accumulation on busy servers. Thousands of sockets stuck in TIME_WAIT can exhaust ephemeral ports — tune net.ipv4.tcp_tw_reuse
- Disabling Nagle's algorithm blindly. Nagle reduces small-packet overhead; disable it only for latency-sensitive apps like gaming or real-time trading
- Not understanding delayed ACKs. The receiver waits up to 200ms hoping to piggyback the ACK on a data response — this interacts badly with Nagle
- Assuming TCP is 'fast enough' without measuring. A single TCP connection on a high-latency link will underperform due to the bandwidth-delay product
Tools for TCP Deep Dive
- tcpdump (Open Source): Packet-level capture and TCP flag inspection on the wire — Scale: Any
- Wireshark (Open Source): Visual TCP stream analysis, retransmission graphs, and expert info — Scale: Any
- ss (iproute2) (Open Source): Fast socket statistics — connection states, window sizes, RTT estimates — Scale: Any
- Packetbeat (Open Source): Real-time TCP flow monitoring integrated with Elasticsearch dashboards — Scale: Medium-Enterprise
Related to TCP Deep Dive
TCP Congestion Control, UDP — When Speed Beats Safety, QUIC Protocol, Connection Pooling & Keep-Alive, Socket Programming Mental Model, Life of a Packet, TLS Handshake — Step by Step, Head-of-Line Blocking
TCP/IP Debugging Toolkit — Performance & Observability
Difficulty: Intermediate
Key Points for TCP/IP Debugging Toolkit
- The best debugging approach is symptom-driven: start with what's broken (timeout, refused, slow, TLS error) and pick the right tool for that symptom
- tcpdump is the universal truth — when logs and metrics disagree, packets don't lie. Learn to capture and filter effectively
- ss -ti exposes TCP internals (RTT, cwnd, retransmits) per connection without packet capture — it's the fastest way to spot TCP issues
- mtr combines traceroute and ping into a continuous path analysis — it reveals which hop is dropping packets or adding latency
- Most 'network issues' are actually application issues. Always check the application layer (curl -v, HTTP status codes) before diving into packets
Common Mistakes with TCP/IP Debugging Toolkit
- Capturing too many packets without filters. Always use tcpdump with port and host filters — an unfiltered capture on a busy server fills disk in seconds
- Running traceroute once and drawing conclusions. Network paths fluctuate — use mtr with 100+ packets to get statistically meaningful results
- Confusing ICMP-based traceroute results with actual TCP path behavior. Some routers rate-limit ICMP, showing false packet loss
- Not checking both sides of the connection. A timeout might be the client not sending, the server not responding, or a middlebox dropping packets
- Forgetting about firewalls and security groups. 'Connection refused' vs 'connection timed out' indicates whether a firewall is dropping (timeout) or rejecting (refused)
Tools for TCP/IP Debugging Toolkit
- Wireshark (Open Source): Deep packet inspection with GUI — TCP stream reassembly, retransmission analysis, protocol dissection — Scale: Development-Production
- tcpdump (Open Source): Command-line packet capture on remote servers — lightweight, available everywhere, scriptable — Scale: Any
- mtr (Open Source): Continuous network path analysis combining traceroute and ping — shows per-hop loss and jitter — Scale: Any
- netcat (nc) (Open Source): Quick connectivity tests — TCP/UDP port checks, simple client-server testing, banner grabbing — Scale: Any
Related to TCP/IP Debugging Toolkit
TCP Deep Dive, TCP Congestion Control, DNS Protocol Deep Dive, TLS Handshake — Step by Step, Life of a Packet, Network Latency — Where Time Goes, Network Observability, OSI Model — The Real Version
TCP vs UDP Decision Framework — Transport & Reliability
Difficulty: Beginner
Key Points for TCP vs UDP Decision Framework
- TCP guarantees ordered, reliable delivery at the cost of head-of-line blocking and connection setup latency — the right choice when every byte must arrive in order
- UDP provides minimal overhead and no head-of-line blocking but shifts reliability entirely to the application layer — the right choice when speed matters more than completeness
- QUIC combines the reliability of TCP with UDP's lack of head-of-line blocking by running independent streams over UDP with built-in TLS 1.3
- The real decision is not 'reliable vs fast' — it is about which guarantees the application actually needs and which it can handle itself
- DNS uses UDP because queries fit in a single packet and retrying is cheaper than maintaining a connection — but DNS-over-HTTPS uses TCP because it runs over HTTP/2
Common Mistakes with TCP vs UDP Decision Framework
- Choosing TCP for real-time media because 'reliability is always better.' Retransmitting a dropped video frame that arrives after the playback deadline is worse than skipping it.
- Choosing UDP for bulk data transfer to 'go faster.' Without congestion control, UDP floods the network and causes massive packet loss for everyone.
- Assuming QUIC is always better than TCP. QUIC runs in userspace, consuming more CPU than kernel-optimized TCP — for simple request-response workloads, TCP is often faster.
- Not implementing any reliability on top of UDP. Games, VoIP, and video all need some form of selective acknowledgment and retransmission — raw UDP is rarely used directly.
- Ignoring SCTP, which provides message boundaries and multi-homing natively. It is the right choice for telephony signaling (SIGTRAN) and WebRTC data channels.
Tools for TCP vs UDP Decision Framework
- TCP (kernel) (Open Source): Web traffic, APIs, database connections, file transfer — any workload that needs guaranteed, ordered delivery with kernel-optimized performance — Scale: Universal
- QUIC (userspace) (Open Source): Web browsing (HTTP/3), mobile apps, and any workload suffering from TCP head-of-line blocking or frequent connection migration — Scale: Growing (40%+ of web traffic)
- KCP (Open Source): Low-latency reliable transport over UDP — popular in game networking and VPN tunnels where TCP retransmission is too slow — Scale: Niche
- ENet (Open Source): Game networking library providing reliable, unreliable, and sequenced channels over UDP with built-in fragmentation — Scale: Game development
Related to TCP vs UDP Decision Framework
TCP Deep Dive, UDP — When Speed Beats Safety, QUIC Protocol, Head-of-Line Blocking, TCP Congestion Control
TLS Handshake — Step by Step — Security & Encryption
Difficulty: Intermediate
Key Points for TLS Handshake — Step by Step
- TLS 1.3 reduced the handshake from 2 round trips to 1, cutting connection setup latency in half.
- Forward secrecy means a compromised server private key cannot decrypt past sessions — each session uses ephemeral keys.
- TLS 1.3 removed RSA key exchange entirely. Only ECDHE-based cipher suites are allowed.
- 0-RTT resumption in TLS 1.3 allows sending application data with the first flight, but is vulnerable to replay attacks.
- The cipher suite determines everything: key exchange algorithm, bulk encryption, and MAC — a bad choice means the connection is insecure.
Common Mistakes with TLS Handshake — Step by Step
- Still supporting TLS 1.0 or 1.1 in production. These are deprecated and have known vulnerabilities.
- Allowing CBC-mode cipher suites that are vulnerable to padding oracle attacks like POODLE and Lucky13.
- Not configuring forward secrecy. Using RSA key exchange means a stolen private key decrypts all historical traffic.
- Ignoring certificate chain errors during development and then shipping that code to production with verify disabled.
- Enabling 0-RTT resumption without understanding the replay attack surface — never use it for non-idempotent requests.
Tools for TLS Handshake — Step by Step
- OpenSSL (Open Source): Industry-standard TLS library with the most features and widest compatibility — Scale: Small-Enterprise
- BoringSSL (Open Source): Google's hardened fork optimized for Chrome and Android, smaller attack surface — Scale: Enterprise
- LibreSSL (Open Source): OpenBSD's security-focused fork with cleaner codebase, fewer CVEs — Scale: Small-Enterprise
- GnuTLS (Open Source): LGPL-licensed alternative when OpenSSL's license is incompatible — Scale: Small-Enterprise
Related to TLS Handshake — Step by Step
Certificates & PKI, mTLS — Mutual Authentication, HTTP/2 — Multiplexing Revolution, HTTP/3 — UDP Takes Over, QUIC Protocol, TCP Deep Dive
UDP — When Speed Beats Safety — Transport & Reliability
Difficulty: Beginner
Key Points for UDP — When Speed Beats Safety
- UDP has no connection setup — no handshake, no state to maintain. A single sendto() call puts a packet on the wire
- The 8-byte UDP header (vs TCP's 20-60 bytes) means less overhead per packet — critical for small, frequent messages like DNS queries
- UDP provides no ordering, no retransmission, no flow control, and no congestion control. The application handles all of this (or accepts the loss)
- UDP is the foundation for protocols that need speed over reliability: DNS, DHCP, NTP, gaming, VoIP, video streaming
- QUIC is proof that reliable, multiplexed transport can be built on top of UDP — doing so enables innovation at the application layer without waiting for OS kernel updates
Common Mistakes with UDP — When Speed Beats Safety
- Saying 'UDP is unreliable, never use it.' UDP is unreliable by design — the question is whether the application needs reliability at the transport layer
- Sending UDP datagrams larger than the path MTU. This causes IP fragmentation, which is far worse than the original packet loss problem
- Not implementing application-level rate limiting. Without TCP's congestion control, UDP can flood the network and harm other traffic
- Assuming UDP datagrams arrive in order. Networks reorder packets — if order matters, the application must handle it
- Using UDP for large file transfers without building reliability on top. This inevitably leads to a poor reimplementation of TCP
Tools for UDP — When Speed Beats Safety
- iperf3 (Open Source): UDP throughput and jitter testing between two endpoints — Scale: Any
- tcpdump (Open Source): Capturing and analyzing UDP packets on the wire — Scale: Any
- netcat (nc) (Open Source): Quick UDP send/receive testing from the command line — Scale: Any
- Wireshark (Open Source): Deep protocol analysis of UDP-based protocols (DNS, QUIC, RTP) — Scale: Any
Related to UDP — When Speed Beats Safety
TCP Deep Dive, QUIC Protocol, DNS Protocol Deep Dive, WebRTC — Peer-to-Peer, MQTT & IoT Protocols, Head-of-Line Blocking, Network Latency — Where Time Goes
VPN & Tunneling — Security & Encryption
Difficulty: Intermediate
Key Points for VPN & Tunneling
- WireGuard has ~4,000 lines of code vs OpenVPN's ~100,000, making it dramatically easier to audit and less likely to have bugs.
- IPSec operates at the kernel level (L3) and is invisible to applications. OpenVPN operates in userspace (L4) and uses a TUN/TAP adapter.
- Split tunneling routes only private network traffic through the VPN, improving performance for internet-bound traffic.
- WireGuard uses the Noise protocol framework for key exchange, achieving a single round trip handshake.
- Site-to-site VPNs connect entire networks, while remote access VPNs connect individual devices to a network.
Common Mistakes with VPN & Tunneling
- Using PPTP in production. Its encryption (MS-CHAPv2) has been broken since 2012. Use WireGuard or IPSec IKEv2.
- Routing all traffic through the VPN (full tunnel) when only private network access is needed, creating a bottleneck.
- Not configuring DNS correctly for split tunnel — DNS queries leak to the ISP, revealing which internal services users access.
- Using pre-shared keys for IPSec instead of certificate-based authentication, making key rotation painful.
- Ignoring MTU issues. VPN encapsulation adds 40-80 bytes of overhead, which can cause silent packet drops if the inner MTU is not reduced.
Tools for VPN & Tunneling
- WireGuard (Open Source): Modern VPN with minimal codebase, excellent performance, and simple configuration — Scale: Small-Enterprise
- OpenVPN (Open Source): Mature VPN with broad platform support, flexible authentication, and extensive plugin ecosystem — Scale: Small-Enterprise
- Tailscale (Managed): Zero-config mesh VPN built on WireGuard with identity-based access control and NAT traversal — Scale: Small-Enterprise
- AWS Site-to-Site VPN (Managed): IPSec VPN connecting on-premises networks to AWS VPCs with redundant tunnels — Scale: Enterprise
Related to VPN & Tunneling
TLS Handshake — Step by Step, IP Addressing & Subnetting, NAT — Network Address Translation, Routing & BGP Basics, Zero Trust Networking, Life of a Packet
WebRTC — Peer-to-Peer — Real-Time & Streaming
Difficulty: Advanced
Key Points for WebRTC — Peer-to-Peer
- WebRTC establishes direct peer-to-peer connections between browsers, bypassing the server for media delivery — reducing latency and server bandwidth costs.
- Signaling is not part of the WebRTC spec. The application must provide its own mechanism (WebSocket, HTTP polling, even copy-pasting SDP) to exchange connection metadata.
- ICE tries multiple connection paths simultaneously: host candidates (local IP), server-reflexive (STUN-discovered public IP), and relay (TURN). It picks the best one that works.
- About 80% of WebRTC connections succeed peer-to-peer via STUN. The remaining 20% — behind symmetric NATs or restrictive firewalls — need a TURN relay server.
- WebRTC encrypts everything by default. DTLS secures the key exchange, SRTP encrypts media, and there is no option to disable encryption — it is mandatory in the spec.
Common Mistakes with WebRTC — Peer-to-Peer
- Forgetting to deploy a TURN server. STUN alone fails for ~20% of users behind symmetric NATs. Without TURN, those users simply cannot connect.
- Using a public TURN server in production. TURN relays significant bandwidth — this demands dedicated infrastructure or a paid service with capacity planning.
- Assuming WebRTC scales like a regular server. Each peer-to-peer connection is point-to-point. A 10-person call requires 9 connections per peer (full mesh), which destroys bandwidth.
- Not implementing a Selective Forwarding Unit (SFU) for group calls. Beyond 3-4 participants, full mesh is impractical — a media server is required.
- Ignoring ICE restart. When a user switches from WiFi to cellular, the ICE candidates change. Without ICE restart, the call drops.
Tools for WebRTC — Peer-to-Peer
- Twilio (Managed): Production-ready video/voice APIs with global TURN infrastructure and recording — Scale: Small-Enterprise
- LiveKit (Open Source): Open-source SFU with room-based video conferencing and well-maintained client SDKs — Scale: Medium-Enterprise
- Janus (Open Source): Lightweight, plugin-based WebRTC gateway for custom media routing pipelines — Scale: Medium-Large
- mediasoup (Open Source): Node.js-based SFU library for building custom video conferencing architectures — Scale: Medium-Enterprise
Related to WebRTC — Peer-to-Peer
UDP — When Speed Beats Safety, NAT — Network Address Translation, TLS Handshake — Step by Step, WebSocket Protocol, Long Polling vs SSE vs WebSocket, QUIC Protocol
WebSocket Protocol — Application Protocols
Difficulty: Intermediate
Key Points for WebSocket Protocol
- WebSocket provides true full-duplex communication — both client and server can send messages independently at any time
- The protocol starts as HTTP and upgrades, making it firewall-friendly and compatible with existing infrastructure
- Client-to-server frames MUST be masked (XOR with a random key) to prevent cache poisoning attacks on proxies
- WebSocket has no built-in reconnection — the application must implement retry logic, exponential backoff, and state reconciliation
- A single WebSocket connection can carry thousands of messages per second with minimal overhead (2-14 bytes per frame)
Common Mistakes with WebSocket Protocol
- Not implementing heartbeat/ping-pong — without it, dead connections go undetected for hours
- Assuming WebSocket connections survive network changes — they don't, unlike QUIC/HTTP/3
- Sending JSON when binary protobuf would halve the bandwidth — WebSocket supports both text and binary frames
- Not handling reconnection logic — the protocol has no auto-reconnect, the application must build it
- Running WebSocket behind a load balancer without sticky sessions — connections can't be seamlessly moved between servers
Tools for WebSocket Protocol
- Socket.IO (Open Source): WebSocket with automatic fallback to long-polling, rooms, and namespaces — Scale: Small to medium real-time apps
- ws (Node.js) (Open Source): Lightweight, spec-compliant WebSocket implementation with no abstractions — Scale: High-performance Node.js servers
- Gorilla WebSocket (Open Source): Production Go WebSocket server with compression and connection management — Scale: High-concurrency Go services
- SignalR (Open Source): .NET real-time framework with automatic transport negotiation and hub abstraction — Scale: Enterprise .NET applications
Related to WebSocket Protocol
Long Polling vs SSE vs WebSocket, Server-Sent Events (SSE), HTTP/1.1 — The Foundation, TCP Deep Dive, Connection Pooling & Keep-Alive, TLS Handshake — Step by Step, CORS — Cross-Origin Resource Sharing
Zero Trust Networking — Modern Patterns
Difficulty: Advanced
Key Points for Zero Trust Networking
- Zero trust eliminates the concept of a trusted internal network — every request is authenticated and authorized regardless of network location.
- Google's BeyondCorp proved the model at scale: 100,000+ employees access internal tools through the same path as external users, with no VPN needed.
- Identity replaces IP addresses as the security primitive — policies say 'service A can call service B' not 'allow 10.0.1.0/24 to 10.0.2.0/24'.
- Continuous verification means authentication is not just at login — every request is re-evaluated against current risk signals, session state, and device posture.
- Micro-segmentation limits blast radius: if an attacker compromises one service, they cannot move laterally because every other service requires independent authorization.
Common Mistakes with Zero Trust Networking
- Treating zero trust as a product that can be bought rather than an architecture to implement. No single vendor delivers complete zero trust.
- Implementing identity verification at the perimeter but still trusting all traffic inside the network — this is just a fancy VPN, not zero trust.
- Not including machine-to-machine (service-to-service) traffic in the zero trust model. If only user-facing requests go through the policy engine, east-west traffic is unprotected.
- Overly permissive policies that effectively allow everything, making the zero trust layer a performance tax with no security benefit.
- Ignoring device posture checks — authenticating the user is not enough if their unpatched laptop is compromised and exfiltrating data.
Tools for Zero Trust Networking
- Cloudflare Access (Managed): Fastest path to zero trust for web applications — identity-aware proxy with no infrastructure to manage — Scale: Small-Enterprise
- Zscaler (Commercial): Enterprise-grade zero trust network access (ZTNA) replacing VPNs, with DLP and threat inspection — Scale: Enterprise
- Google BeyondCorp Enterprise (Managed): Google-native zero trust with Chrome integration, DLP, and threat protection for Google Workspace customers — Scale: Enterprise
- Tailscale (Managed): WireGuard-based mesh VPN with identity-aware ACLs — simplest path to zero trust for internal tools and SSH — Scale: Small-Large
Related to Zero Trust Networking
mTLS — Mutual Authentication, TLS Handshake — Step by Step, OAuth 2.0 & OIDC Flows, Certificates & PKI, Service Mesh Networking, VPN & Tunneling