Alerting & On-Call — Observability & Reliability
Difficulty: Intermediate
Key Points for Alerting & On-Call
- Every alert that fires should require a human to do something. If it doesn't, delete it.
- Alert on SLO burn rate, not raw metrics. 'Burned 10% of error budget in 1 hour' beats 'error rate > 1%' every time.
- Escalation policies route alerts to the right person: primary, then secondary, then manager, then incident commander.
- Runbooks cut MTTR by giving the on-call engineer an actual playbook instead of guesswork at 3 AM.
- If the on-call gets more than 2 pages per shift, the system needs fixing. More heroics won't help.
Common Mistakes with Alerting & On-Call
- Alert fatigue. Too many non-actionable alerts train responders to ignore everything, including the real incidents.
- No deduplication or grouping. One cascading database failure should not produce 100 separate notifications.
- Missing runbooks. The alert fires at 3 AM and the on-call engineer has zero context on what to do next.
- Not tracking the alert-to-incident ratio. A high ratio means the team is drowning in false positives.
- Alerting on error rate without considering traffic volume. One error in two requests is a 50% error rate, but it's meaningless.
Tools for Alerting & On-Call
- PagerDuty (Commercial): Incident management, escalation policies, integrations — Scale: Medium-Enterprise
- OpsGenie (Commercial): Atlassian integration, alert routing, on-call scheduling — Scale: Medium-Enterprise
- Prometheus Alertmanager (Open Source): Prometheus-native, grouping, silencing, inhibition — Scale: Medium-Enterprise
- Grafana OnCall (Open Source): Grafana-native, ChatOps, escalation chains — Scale: Small-Large
Related to Alerting & On-Call
Metrics & Monitoring, Distributed Logging, Distributed Tracing
API Gateway — Networking & Traffic
Difficulty: Intermediate
Key Points for API Gateway
- Single entry point for all client requests, centralizing cross-cutting concerns like auth and rate limiting
- Handles authentication, rate limiting, routing, and protocol translation in one place
- Can aggregate multiple microservice calls into a single client response
- Sits on the critical path. Must be highly available and low-latency or everything suffers
- Decouples client interface from internal service topology
Common Mistakes with API Gateway
- Single point of failure without redundancy. Always deploy at least two instances behind a load balancer
- Putting business logic in the gateway layer. Keep it thin: route and validate, nothing else
- Not implementing circuit breakers for downstream service failures
- Ignoring tail latency. The gateway adds P99 overhead to every single request
- Skipping request/response transformation versioning, which breaks clients on deploy
Tools for API Gateway
- Kong (Open Source): Plugin ecosystem, Lua extensibility — Scale: Medium-Enterprise
- AWS API Gateway (Managed): Serverless, Lambda integration — Scale: Small-Enterprise
- Envoy (Open Source): Service mesh sidecar, gRPC-native — Scale: Large-Enterprise
- NGINX (Open Source): High-performance reverse proxy — Scale: Small-Enterprise
Related to API Gateway
Load Balancer, WebSocket Gateway, Service Mesh, Rate Limiting & Throttling, DNS & Service Discovery
Artifact Management & Container Registry — CI/CD & Deployment
Difficulty: Intermediate
Key Points for Artifact Management & Container Registry
- Central repository for build artifacts (container images, packages, binaries) with versioning and access control
- Container registries store OCI images with layer deduplication. Only changed layers get pushed or pulled.
- Image scanning detects CVEs in base images and dependencies before deployment
- Immutable tags (use SHA digests, not :latest) ensure reproducible deployments
- Proximity matters. A registry in the same region as the cluster cuts pull times dramatically.
Common Mistakes with Artifact Management & Container Registry
- Using :latest in production. There is no way to deterministically reproduce a deployment.
- Not scanning images before deployment. Known CVEs walk straight into production.
- Storing secrets in image layers. They persist in layer history forever.
- No image retention policy. Registry storage grows unbounded and costs creep up.
- Pulling from public registries in CI. Rate limits and outages will break the pipeline.
Tools for Artifact Management & Container Registry
- AWS ECR (Managed): EKS integration, lifecycle policies, scanning — Scale: Small-Enterprise
- Harbor (Open Source): Enterprise registry, replication, RBAC, scanning — Scale: Medium-Enterprise
- Docker Hub (Managed): Public images, community ecosystem — Scale: Small-Medium
- GitHub Packages (Managed): GitHub-native, multi-format (npm, Docker, Maven) — Scale: Small-Large
Related to Artifact Management & Container Registry
CI/CD Pipeline Design, Container Runtime & Docker, Secrets Management, Kubernetes Architecture
Auto-Scaling Patterns — Compute & Orchestration
Difficulty: Advanced
Key Points for Auto-Scaling Patterns
- Dynamically adjusts compute capacity based on demand. Scales out under load, scales in when idle.
- Three main approaches: reactive scaling (metric thresholds), predictive scaling (ML-based forecasting), and scheduled scaling
- Horizontal Pod Autoscaler (HPA) scales pods. Cluster Autoscaler scales nodes. Karpenter can replace both.
- Scale-out should be fast. Scale-in needs to be conservative, or oscillation occurs.
- Custom metrics like queue depth and business KPIs usually make better scaling signals than CPU or memory
Common Mistakes with Auto-Scaling Patterns
- Scaling on CPU only. High CPU doesn't always mean the service needs more instances.
- Setting scale-in cooldown too short. This causes thrashing: scale out, scale in, scale out, repeat.
- Not accounting for pod startup time. New pods aren't ready for 30-60 seconds after creation.
- Ignoring cluster autoscaler lag. Node provisioning takes 2-5 minutes on most cloud providers.
- Skipping PodDisruptionBudgets during scale-in. Terminating too many pods at once kills availability.
Tools for Auto-Scaling Patterns
- Kubernetes HPA (Open Source): Pod-level scaling, custom metrics API — Scale: Medium-Enterprise
- Karpenter (Open Source): Fast node provisioning, instance type selection — Scale: Medium-Enterprise
- AWS Auto Scaling (Managed): EC2/ECS scaling, target tracking policies — Scale: Small-Enterprise
- KEDA (Open Source): Event-driven scaling, scale-to-zero — Scale: Medium-Enterprise
Related to Auto-Scaling Patterns
Kubernetes Architecture, Metrics & Monitoring, Load Balancer, Serverless & FaaS
Caching Strategies — Data & Storage
Difficulty: Intermediate
Key Points for Caching Strategies
- Cuts database load and latency by serving hot data from in-memory stores instead of disk
- Cache-aside, read-through, write-through, write-behind. Each pattern fits different read/write ratios
- Cache invalidation is genuinely hard. TTL, event-driven invalidation, and versioned keys are the main tools
- Cache stampede (thundering herd) hits when many requests miss at the same time. Use locking or stale-while-revalidate
- Multi-tier caching (L1 in-process, L2 distributed) trades off latency against consistency
Common Mistakes with Caching Strategies
- Caching without setting TTL. Stale data sticks around forever
- Treating cache as a primary data store. Eviction means data loss
- Not handling cache failures. If Redis goes down, the app should fall back to the database, not crash
- Sloppy key design. Poorly namespaced keys cause silent data overwrites
- Skipping cache warming after deploys. A cold cache on restart hammers the database
Tools for Caching Strategies
- Redis (Open Source): Rich data structures, pub/sub, persistence options — Scale: Small-Enterprise
- Memcached (Open Source): Simple key-value, multi-threaded, maximum throughput — Scale: Medium-Enterprise
- Caffeine (Open Source): JVM in-process cache, near-optimal hit ratio — Scale: Small-Large
- Hazelcast (Open Source): Distributed cache, embedded or client-server — Scale: Medium-Enterprise
Related to Caching Strategies
CDN & Edge Computing, Database Sharding, Rate Limiting & Throttling
CDN & Edge Computing — Networking & Traffic
Difficulty: Intermediate
Key Points for CDN & Edge Computing
- Caches content at geographically distributed edge nodes close to users
- Cuts origin server load and drops latency by 50-90% for static assets
- Edge computing goes beyond caching. Run actual logic at the edge with Workers or Lambda@Edge
- Cache invalidation is the genuinely hard part. TTL, purge APIs, stale-while-revalidate all have tradeoffs
- Shield/origin-shield pattern prevents thundering herd on cache misses
Common Mistakes with CDN & Edge Computing
- Setting overly long TTLs without a purge strategy. Stale content ends up stuck globally
- Caching responses with Set-Cookie headers, which serves one user's session to another
- Not varying cache keys on relevant headers (Accept-Encoding, Accept-Language)
- Ignoring cache hit ratio metrics. A low hit ratio means the CDN is just adding latency for no benefit
- Skipping origin shield. Without it, N edge PoPs each independently hammer the origin on a miss
Tools for CDN & Edge Computing
- CloudFront (Managed): AWS ecosystem, Lambda@Edge — Scale: Small-Enterprise
- Cloudflare (Managed): Edge Workers, DDoS protection, massive PoP network — Scale: Small-Enterprise
- Fastly (Managed): Instant purge, VCL customization, real-time logging — Scale: Medium-Enterprise
- Akamai (Commercial): Largest network, media delivery, enterprise SLAs — Scale: Enterprise
Related to CDN & Edge Computing
Load Balancer, Caching Strategies, DNS & Service Discovery
CI/CD Pipeline Design — CI/CD & Deployment
Difficulty: Intermediate
Key Points for CI/CD Pipeline Design
- Continuous Integration merges code frequently; Continuous Delivery automates release; Continuous Deployment auto-deploys
- Pipeline stages: lint, test, build, security scan, deploy to staging, integration test, deploy to prod
- Fast feedback loops matter more than anything else. Aim for under 10 min from commit to test results.
- Trunk-based development with short-lived feature branches keeps merge conflicts small and manageable
- Pipeline as code (Jenkinsfile, .github/workflows) makes builds reproducible and auditable
Common Mistakes with CI/CD Pipeline Design
- Not parallelizing independent test suites. Sequential execution wastes minutes per build.
- Running all tests on every PR. Use affected/changed file detection for large monorepos.
- Manual deployment steps in the pipeline. That defeats the whole point of automation.
- Not caching dependencies between builds. A fresh npm install adds 2-5 minutes every time.
- Sharing mutable state between pipeline stages, which causes flaky tests from leftover state
Tools for CI/CD Pipeline Design
- GitHub Actions (Managed): GitHub-native, marketplace actions, matrix builds — Scale: Small-Enterprise
- GitLab CI (Open Source): Integrated DevOps platform, self-hosted option — Scale: Medium-Enterprise
- Jenkins (Open Source): Maximum flexibility, plugin ecosystem, self-hosted — Scale: Medium-Enterprise
- CircleCI (Managed): Docker-first, fast builds, orbs ecosystem — Scale: Small-Large
Related to CI/CD Pipeline Design
Deployment Strategies, GitOps & Infrastructure as Code, Artifact Management & Container Registry, Container Runtime & Docker
Circuit Breaker & Resilience Patterns — Observability & Reliability
Difficulty: Advanced
Key Points for Circuit Breaker & Resilience Patterns
- Circuit breakers stop cascading failures by fast-failing requests to dependencies that are already down
- Three states (Closed, Open, Half-Open) with transitions driven by failure rate thresholds
- Retry with exponential backoff plus jitter prevents hammering a recovering service with synchronized retries
- Bulkheads isolate failures by capping concurrent requests per dependency so one bad actor can't eat all the threads
- Timeout propagation passes deadlines across service boundaries. Without it, downstream services waste work on requests the caller already gave up on
Common Mistakes with Circuit Breaker & Resilience Patterns
- Thresholds too tight. A 1% error rate in a 10-request window trips the circuit on normal noise
- No fallbacks. An open circuit that throws a raw error at users is worse than returning stale data or a degraded response
- Retrying non-idempotent operations. Retrying a payment charge without idempotency keys means double charges. Expect a 3am page
- One timeout for every dependency. A database query and an external API have completely different baseline latencies. Treat them differently
- Not propagating deadlines. Service A sets 5s, service B calls C with another 5s. Now the total exceeds any reasonable budget and the caller already gave up
Tools for Circuit Breaker & Resilience Patterns
- Resilience4j (Open Source): Java/Kotlin, lightweight, functional API, Spring Boot integration — Scale: Small-Enterprise
- Hystrix (Netflix) (Open Source): Pioneered the pattern, now deprecated. Migrate to Resilience4j — Scale: Legacy
- Envoy Circuit Breaking (Open Source): Infrastructure-level, language-agnostic, mesh-native — Scale: Medium-Enterprise
- Polly (.NET) (Open Source): .NET ecosystem, fluent API, wide range of policy types — Scale: Small-Enterprise
Related to Circuit Breaker & Resilience Patterns
Service Mesh, Rate Limiting & Throttling, Auto-Scaling Patterns, Alerting & On-Call
Compliance & Audit Logging — Security & Governance
Difficulty: Advanced
Key Points for Compliance & Audit Logging
- Audit logs answer the only question that matters after an incident: who did what, when, and from where
- Immutable, append-only storage is non-negotiable. If someone can delete the logs, the logs are worthless
- SOC 2, GDPR, HIPAA, PCI-DSS all require specific logging and retention policies, and auditors will check
- Separation of duties matters. Engineers who deploy code should never be able to touch audit logs
- Automated compliance checks in CI/CD catch policy violations before they hit production
Common Mistakes with Compliance & Audit Logging
- Storing audit logs in the same system they audit. A compromised system can wipe its own trail
- Ignoring failed authentication attempts. These are often the earliest signal of an attack
- Keeping logs for too short a period. Compliance frameworks demand 1-7 years depending on the standard
- No alerting on suspicious audit events. Logs nobody reads are just expensive storage
- Stuffing PII into audit records. The audit logs become their own compliance liability
Tools for Compliance & Audit Logging
- AWS CloudTrail (Managed): AWS API audit logging, S3 integration — Scale: Small-Enterprise
- Falco (Open Source): Runtime security, K8s audit, eBPF-based — Scale: Medium-Enterprise
- Splunk (Commercial): Enterprise SIEM, compliance reporting, SPL queries — Scale: Large-Enterprise
- Open Policy Agent (Open Source): Policy enforcement, admission webhooks, audit — Scale: Medium-Enterprise
Related to Compliance & Audit Logging
Secrets Management, Zero Trust & Network Security, Distributed Logging, Object Storage & Data Lake
Container Runtime & Docker — Compute & Orchestration
Difficulty: Intermediate
Key Points for Container Runtime & Docker
- Containers isolate processes using Linux namespaces and cgroups. They are not VMs.
- OCI defines the image format and runtime spec. Docker is just one implementation.
- containerd and CRI-O are the real runtimes in Kubernetes. Docker (dockershim) was removed in K8s 1.24.
- Image layers use copy-on-write. Shared base layers save disk space and speed up pulls.
- Multi-stage builds keep final images small by throwing away build-time dependencies.
Common Mistakes with Container Runtime & Docker
- Running containers as root. A compromised container gets host-level access.
- Using bloated base images (ubuntu:latest is 77MB) when distroless or alpine will do at 5MB.
- Storing secrets in image layers. They persist in layer history even after deletion in a later layer.
- Not pinning base image versions. FROM python:3 pulls different images over time and will cause breakage.
- Ignoring .dockerignore. The build context drags in junk files and builds slow down for no reason.
Tools for Container Runtime & Docker
- containerd (Open Source): Kubernetes default runtime, CNCF graduated — Scale: Medium-Enterprise
- Docker Engine (Open Source): Developer experience, Docker Compose, build tooling — Scale: Small-Large
- CRI-O (Open Source): Minimal K8s-focused runtime, OpenShift default — Scale: Medium-Enterprise
- Podman (Open Source): Rootless containers, daemonless, Docker CLI-compatible — Scale: Small-Medium
Related to Container Runtime & Docker
Kubernetes Architecture, CI/CD Pipeline Design, Artifact Management & Container Registry, Secrets Management
Database Sharding — Data & Storage
Difficulty: Advanced
Key Points for Database Sharding
- Horizontally partitions data across multiple database instances using a shard key
- Shard key selection is the single most important decision. Get it wrong and the result is hotspots and cross-shard queries everywhere
- Range-based, hash-based, and directory-based sharding each come with real trade-offs
- Resharding (adding or removing shards) is operationally painful. Plan capacity early
- Cross-shard transactions need two-phase commit or saga patterns. Avoid them whenever possible
Common Mistakes with Database Sharding
- Choosing a shard key with low cardinality, so all data ends up on one shard
- Not planning for resharding. Data growth will make the initial shard count insufficient
- Designing queries that need cross-shard joins. This defeats the whole point of sharding
- Using auto-increment IDs as shard keys. This creates write hotspots on the latest shard
- Skipping shard failure testing. Losing one shard should not take down the entire system
Tools for Database Sharding
- Vitess (Open Source): MySQL sharding, used by YouTube/Slack — Scale: Large-Enterprise
- CockroachDB (Open Source): Auto-sharding, distributed SQL, strong consistency — Scale: Medium-Enterprise
- Citus (PostgreSQL) (Open Source): PostgreSQL extension, transparent sharding — Scale: Medium-Enterprise
- MongoDB (Open Source): Native sharding with config servers and mongos — Scale: Medium-Enterprise
Related to Database Sharding
Replication & Consistency, Caching Strategies, Load Balancer
Deployment Strategies — CI/CD & Deployment
Difficulty: Intermediate
Key Points for Deployment Strategies
- Blue-green deploys switch traffic between two identical environments for instant rollback
- Canary releases gradually shift traffic (1% to 5% to 25% to 100%) to catch issues early
- Rolling updates replace instances one at a time. This is the Kubernetes default.
- Feature flags decouple deployment from release. Deploy dark, then enable for specific users.
- Database migrations must be backward-compatible because old code runs alongside new code during rollout
Common Mistakes with Deployment Strategies
- Not testing rollback procedures. Rollback always fails when it matters most.
- Running incompatible database migrations that break old code during a rolling deploy
- Canary without automated analysis. Humans cannot watch dashboards 24/7.
- Blue-green without enough capacity. It takes 2x infrastructure while deploying.
- Ignoring deployment velocity. Infrequent large deploys are far riskier than frequent small ones.
Tools for Deployment Strategies
- Argo Rollouts (Open Source): K8s-native canary/blue-green with analysis — Scale: Medium-Enterprise
- Flagger (Open Source): Service mesh integration, automated canary — Scale: Medium-Enterprise
- Spinnaker (Open Source): Multi-cloud deployment pipelines — Scale: Large-Enterprise
- AWS CodeDeploy (Managed): EC2/ECS/Lambda deployments, traffic shifting — Scale: Small-Enterprise
Related to Deployment Strategies
CI/CD Pipeline Design, Load Balancer, Metrics & Monitoring, Auto-Scaling Patterns
Distributed Logging — Observability & Reliability
Difficulty: Intermediate
Key Points for Distributed Logging
- Centralized logging pulls logs from all services into one searchable system
- Structured logging (JSON) makes querying and filtering possible. Unstructured text logs become useless past a handful of services
- ELK/EFK stack (Elasticsearch, Fluentd/Logstash, Kibana) is the classic open-source approach
- Log levels (DEBUG, INFO, WARN, ERROR) should be adjustable at runtime without a redeploy
- Correlation IDs (trace IDs) connect logs across services for a single request, and they are non-negotiable for debugging
Common Mistakes with Distributed Logging
- Logging sensitive data (passwords, tokens, PII), which violates compliance and creates security risks
- Unstructured log messages. Grep-based debugging falls apart past 10 services
- Not setting log retention policies, so storage costs grow forever
- Logging too much at INFO level, overwhelming the logging pipeline with noise
- Not buffering logs. Direct writes to Elasticsearch from every pod cause write amplification
Tools for Distributed Logging
- Grafana Loki (Open Source): Log aggregation without full-text indexing, cost-efficient — Scale: Medium-Enterprise
- Elasticsearch + Kibana (Open Source): Full-text search, complex queries, mature ecosystem — Scale: Medium-Enterprise
- Datadog Logs (Commercial): Unified with metrics/traces, live tail, patterns — Scale: Small-Enterprise
- Fluentd/Fluent Bit (Open Source): Log collection and routing, CNCF graduated — Scale: Medium-Enterprise
Related to Distributed Logging
Metrics & Monitoring, Distributed Tracing, Compliance & Audit Logging, Object Storage & Data Lake
Distributed Tracing — Observability & Reliability
Difficulty: Advanced
Key Points for Distributed Tracing
- Tracks a single request across multiple services, showing the complete call chain
- Traces are made of spans. Each span is one unit of work with timing and metadata
- OpenTelemetry is the instrumentation standard. Vendor-neutral, covers metrics, logs, and traces
- Head-based vs tail-based sampling. Tail-based is better at catching errors and slow requests
- Trace context propagation via W3C Trace Context headers (traceparent, tracestate) is the standard
Common Mistakes with Distributed Tracing
- Tracing 100% of requests in production. Storage and processing costs become crushing at scale
- Not propagating trace context through message queues. Async calls silently break the trace chain
- Only instrumenting HTTP calls. Database queries, cache lookups, and queue operations need spans too
- Ignoring sampling configuration. Default head-based sampling misses rare but important errors
- Not correlating traces with logs and metrics. All three pillars should be linked by trace ID
Tools for Distributed Tracing
- Jaeger (Open Source): Distributed tracing, CNCF graduated, mature UI — Scale: Medium-Enterprise
- Grafana Tempo (Open Source): Object storage backend, cost-efficient, TraceQL — Scale: Medium-Enterprise
- OpenTelemetry (Open Source): Vendor-neutral instrumentation SDK and collector — Scale: Small-Enterprise
- Datadog APM (Commercial): Unified observability, service maps, error tracking — Scale: Small-Enterprise
Related to Distributed Tracing
Metrics & Monitoring, Distributed Logging, Service Mesh, Alerting & On-Call
DNS & Service Discovery — Networking & Traffic
Difficulty: Intermediate
Key Points for DNS & Service Discovery
- DNS maps human-readable names to IP addresses. It is, quite literally, the internet's phone book.
- Service discovery lets microservices find each other at runtime instead of relying on hardcoded addresses
- Client-side vs server-side discovery: different trade-offs in complexity and load distribution
- TTL management matters more than people think. Too short and it hammers the DNS servers, too long and failover stalls.
- Health-aware DNS (Route 53, Consul) pulls unhealthy endpoints out of responses automatically
Common Mistakes with DNS & Service Discovery
- Caching DNS results forever in application code. Ignoring the TTL means routing traffic to dead hosts.
- Not retrying with fresh DNS resolution after a connection failure. The cached IP might be the problem.
- Using DNS for load balancing without realizing that most clients cache the first resolved IP and stick with it
- Running the service registry as a single instance, which turns it into a single point of failure for the entire fleet
- Forgetting to monitor DNS resolution latency. Slow DNS adds hidden latency to every single request.
Tools for DNS & Service Discovery
- Consul (Open Source): Service mesh, health checks, KV store — Scale: Medium-Enterprise
- CoreDNS (Open Source): Kubernetes DNS, plugin-based — Scale: Medium-Enterprise
- AWS Route 53 (Managed): Global DNS, health checks, failover routing — Scale: Small-Enterprise
- etcd (Open Source): Kubernetes backing store, strong consistency — Scale: Medium-Large
Related to DNS & Service Discovery
API Gateway, Load Balancer, Service Mesh, Service Discovery & Registration
DPDK (Data Plane Development Kit) — Networking & Traffic
Difficulty: Expert
Key Points for DPDK (Data Plane Development Kit)
- Complete kernel bypass. The NIC talks directly to the application through userspace memory. No syscalls, no interrupts, no socket buffers.
- 15-20M+ packets/sec per core. Roughly 10-20x what the kernel networking stack can do.
- Requires dedicated CPU cores that poll the NIC in a tight loop. Those cores run at 100% even when idle.
- The NIC is taken away from the kernel entirely. Normal sockets, ping, tcpdump stop working on that interface.
- Used by telecom NFV, financial exchanges, high-throughput packet brokers, and observability collector gateways
Common Mistakes with DPDK (Data Plane Development Kit)
- Underestimating the operational cost. DPDK takes over the NIC. tcpdump, ping, iptables, and every other kernel networking tool stop working on that interface. A separate management NIC is required.
- Not reserving hugepages at boot time. DPDK needs 1GB or 2MB hugepages. Trying to allocate them after the system has been running means memory fragmentation results in fewer than expected.
- Running DPDK on all NICs. Only bind DPDK to the data-plane NIC. Keep at least one NIC on the kernel stack for SSH, monitoring, and management traffic.
- Forgetting that dedicated cores mean those cores are unavailable to everything else. On a 16-core box, burning 4 cores for DPDK polling leaves 12 for the application. Plan the CPU budget.
- Assuming DPDK is always better than XDP. For workloads under 10M pps, XDP delivers 80% of the performance with 20% of the complexity. DPDK only makes sense at the fan-in points where traffic from hundreds or thousands of sources converges.
Tools for DPDK (Data Plane Development Kit)
- DPDK (Open Source): Maximum packet throughput, full control over packet processing — Scale: Large-Enterprise
- fd.io VPP (Open Source): High-performance virtual switch/router built on DPDK — Scale: Large-Enterprise
- XDP + eBPF (Open Source): Lighter weight, no dedicated cores, runs alongside normal kernel networking — Scale: Medium-Enterprise
- Netmap (Open Source): Simpler kernel bypass alternative, less ecosystem than DPDK — Scale: Medium-Large
Related to DPDK (Data Plane Development Kit)
XDP (eXpress Data Path), Load Balancer, Service Mesh
GitOps & Infrastructure as Code — CI/CD & Deployment
Difficulty: Advanced
Key Points for GitOps & Infrastructure as Code
- Git as the single source of truth for both application and infrastructure config
- Declarative infrastructure: define the desired state, not how to get there
- Pull-based GitOps (ArgoCD, Flux) vs push-based IaC (Terraform apply in CI)
- Drift detection through continuous reconciliation, keeping actual state in line with desired state
- Infrastructure changes go through the same PR review process as application code
Common Mistakes with GitOps & Infrastructure as Code
- Running kubectl apply by hand in production, which throws away the audit trail and review process entirely
- Storing Terraform state locally without a remote backend and locking. Two concurrent applies will corrupt the state.
- Copy-pasting infrastructure instead of writing reusable modules. This falls apart the moment anything needs updating.
- Mixing application deployment with infrastructure provisioning in the same pipeline
- Skipping infrastructure tests. Terraform plan is the unit test. Apply to staging is the integration test. Do both.
Tools for GitOps & Infrastructure as Code
- ArgoCD (Open Source): K8s GitOps, multi-cluster, UI dashboard — Scale: Medium-Enterprise
- Terraform (Open Source): Multi-cloud IaC, stateful resource management — Scale: Small-Enterprise
- Pulumi (Open Source): IaC in real programming languages (TS, Python, Go) — Scale: Small-Enterprise
- Crossplane (Open Source): K8s-native cloud resource provisioning via CRDs — Scale: Medium-Enterprise
Related to GitOps & Infrastructure as Code
CI/CD Pipeline Design, Kubernetes Architecture, Secrets Management, Compliance & Audit Logging
Hot, Warm & Cold Data Tiering — Data & Storage
Difficulty: Advanced
Key Points for Hot, Warm & Cold Data Tiering
- Not all data deserves the same storage. Hot data sits in memory or NVMe SSDs, warm data on cheaper SSDs, cold data on object storage. The cost difference between tiers is 100x
- The boundary between tiers is defined by access frequency and latency requirements, not by data age alone. A 3-year-old record queried daily is hot, not cold
- Promotion and demotion policies determine when data moves between tiers. Time-based is simplest, access-count-based is most accurate, most teams use a hybrid
- Elasticsearch, ClickHouse, and Kafka all have built-in tiered storage. A custom query routing layer is not always necessary
- The query pattern matters as much as the storage tier. A cold-tier query that scans terabytes needs a different approach (async, pre-aggregated) than a hot-tier point lookup
Common Mistakes with Hot, Warm & Cold Data Tiering
- Tiering by age alone. A 3-year-old record that gets queried daily is hot, not cold. Measure access frequency before drawing tier boundaries
- No warm tier. Going straight from in-memory cache to S3 creates a latency cliff where responses jump from 1ms to 3 seconds with nothing in between
- Forgetting that cold-tier queries still need to be usable. Users do not care where the data lives. If the query takes 8 seconds, that is the experience
- Not measuring access patterns before choosing tier boundaries. Most teams guess wrong about what is hot. Instrument first, tier second
- Over-engineering tiering for small datasets. If everything fits on a single SSD, skip the complexity and just use one SSD
Tools for Hot, Warm & Cold Data Tiering
- Elasticsearch ILM (Open Source): Log and event data lifecycle with automatic rollover, shrink, freeze, and delete — Scale: Medium-Enterprise
- ClickHouse Tiered Storage (Open Source): Analytics data with volume-based policies, S3-backed MergeTree for cold partitions — Scale: Medium-Enterprise
- Kafka Tiered Storage (KIP-405) (Open Source): Event log retention beyond broker disk, transparent S3 offload for old segments — Scale: Large-Enterprise
- AWS S3 Intelligent-Tiering (Managed): Object storage with automatic access-pattern-based tiering, no retrieval fees — Scale: Small-Enterprise
- Snowflake (Commercial): Transparent hot/warm/cold with auto-scaling compute per tier, zero admin — Scale: Medium-Enterprise
Related to Hot, Warm & Cold Data Tiering
Caching Strategies, Object Storage & Data Lake, Database Sharding
Istio — Networking & Traffic
Difficulty: Advanced
Key Points for Istio
- Istiod is the single-binary control plane that merges Pilot, Citadel, and Galley. It compiles routing rules into Envoy xDS configuration and pushes it to every sidecar over gRPC streaming.
- VirtualService and DestinationRule are the two most-used CRDs. VirtualService controls where traffic goes. DestinationRule controls what happens when it gets there.
- mTLS is automatic through SPIFFE identity. Citadel issues short-lived certificates (24h TTL) with no application code changes required.
- Ambient mode replaces per-pod sidecars with per-node ztunnel (L4) and optional waypoint proxies (L7), cutting resource overhead by 60-80%.
- At 1,000 sidecars, budget roughly 100GB of additional cluster memory and 1GB for Istiod. Know these numbers before committing.
Common Mistakes with Istio
- Deploying Istio mesh-wide on day one instead of adopting namespace by namespace. This turns every misconfiguration into a cluster-wide incident.
- Writing VirtualService rules without understanding Envoy route matching precedence. Longest prefix match is not the same as first match. Read the Envoy docs, not just the Istio docs.
- Ignoring Istiod resource limits. A single Istiod instance managing 3,000+ sidecars without tuned memory limits will OOMKill during large config pushes.
- Not running istioctl analyze in CI. Invalid CRDs silently break routing, and the problem goes unnoticed until traffic stops flowing.
- Enabling Wasm plugins in production without load-testing them first. A slow plugin adds latency to every single request through that sidecar.
- Leaving PeerAuthentication in PERMISSIVE mode permanently. It's meant for migration, not as a final state. The result is half-mTLS with a false sense of security.
- Skipping revision-based canary upgrades for Istiod itself. In-place upgrades risk dropping xDS connections to all sidecars simultaneously.
Tools for Istio
- Istio (Sidecar Mode) (Open Source): Full L7 traffic management, policy enforcement, multi-cluster — Scale: Large-Enterprise
- Istio (Ambient Mode) (Open Source): Lower resource overhead, no sidecar injection, L4 by default — Scale: Medium-Enterprise
- Envoy Gateway (Open Source): Kubernetes Gateway API ingress without full mesh overhead — Scale: Medium-Large
- Gloo Mesh (Commercial): Multi-cluster Istio management with enterprise support — Scale: Large-Enterprise
Related to Istio
Service Mesh, Load Balancer, API Gateway, Kubernetes Architecture, Distributed Tracing, Zero Trust & Network Security
Kubernetes Architecture — Compute & Orchestration
Difficulty: Advanced
Key Points for Kubernetes Architecture
- Container orchestration platform born from Google's Borg. Automates deployment, scaling, self-healing, and rollback for containerized workloads.
- The control plane (API server, etcd, scheduler, controller manager, cloud controller manager) tracks desired state. Worker nodes run actual workloads through kubelet and a CRI-compatible container runtime.
- Operators extend Kubernetes by encoding operational knowledge into custom controllers paired with CRDs. They turn complex stateful apps into first-class platform citizens.
- Node fingerprinting (via kubelet and NFD) discovers hardware capabilities like GPUs, CPU instruction sets, and NVMe disks so the scheduler places workloads on the right machines.
- Gateway API has replaced Ingress as the standard for traffic routing. It provides typed routes (HTTP, gRPC, TCP, TLS) with role-oriented resource ownership.
- Resource requests and limits, HPA, VPA, and Karpenter enable right-sizing workloads and auto-scaling nodes in response to real demand.
Common Mistakes with Kubernetes Architecture
- Skipping resource requests and limits. Pods will get OOMKilled or starve neighbors. Every production pod needs both.
- Using the latest tag in production. Deterministic deploys are lost and rollbacks become a guessing game. Pin to digests or immutable tags.
- Running workloads without PodDisruptionBudgets. Cluster upgrades and node drains will nuke all replicas at once.
- Ignoring namespace resource quotas. One team's runaway deployment eats the whole cluster budget.
- Not configuring liveness and readiness probes. Without them, Kubernetes routes traffic to broken pods and never restarts stuck containers.
- Writing operators without rate limiting on the reconcile loop. A tight loop hammering the API server can destabilize the entire control plane.
- Treating Ingress as the long-term routing solution. Gateway API is the standard now. New clusters should start with it.
- Skipping node fingerprinting for GPU or specialized workloads. Without proper labels, the scheduler has no way to match workloads to hardware capabilities.
Tools for Kubernetes Architecture
- EKS (Managed): AWS-native, Fargate serverless pods, deep IAM integration — Scale: Medium-Enterprise
- GKE (Managed): Autopilot mode, fastest upstream K8s releases, best managed experience — Scale: Medium-Enterprise
- AKS (Managed): Azure-native, KEDA built-in, good Windows container support — Scale: Medium-Enterprise
- k3s (Open Source): Lightweight single binary, edge and IoT, ARM support — Scale: Small-Medium
- Talos Linux (Open Source): Immutable OS purpose-built for K8s, API-managed nodes, no SSH — Scale: Medium-Enterprise
- OpenShift (Commercial): Enterprise compliance, integrated CI/CD, developer portal — Scale: Large-Enterprise
Related to Kubernetes Architecture
Container Runtime & Docker, Service Mesh, Auto-Scaling Patterns, Service Discovery & Registration, GitOps & Infrastructure as Code, Deployment Strategies, Secrets Management
Linkerd — Networking & Traffic
Difficulty: Advanced
Key Points for Linkerd
- Built on linkerd2-proxy, a purpose-built Rust proxy that uses roughly 30MB per sidecar compared to Envoy's 100MB. Smaller memory footprint, fewer moving parts.
- mTLS is automatic and always-on from the moment the proxy is injected. No configuration, no PeerAuthentication CRD, no PERMISSIVE mode to forget about.
- Per-route golden metrics (success rate, latency, throughput) work out of the box without touching application code or Prometheus configuration.
- Follows Service Mesh Interface (SMI) specs for traffic splitting. TrafficSplit manifests stay portable across mesh implementations.
- Median install-to-production time is 2 weeks. That's not marketing. Buoyant tracks this across their customer base.
Common Mistakes with Linkerd
- Expecting Istio-level traffic management CRDs. Linkerd intentionally keeps its API surface small. For VirtualService-level routing with header matching and fault injection, Istio is the right tool, not Linkerd.
- Skipping linkerd check before troubleshooting. Run it first. 90% of issues show up in the built-in diagnostics, and it takes 10 seconds.
- Not configuring opaque ports for non-HTTP TCP traffic. The proxy tries to parse everything as HTTP by default, which breaks protocols like MySQL, Redis, and NATS.
- Ignoring the proxy injector webhook priority. If other admission webhooks modify pods after Linkerd's injection, the sidecar config can get corrupted silently.
- Assuming multi-cluster just works out of the box. It still requires setting up gateway pods, linking clusters, and mirroring services explicitly.
- Running the viz extension in production without retention limits. Prometheus scraping per-route metrics across thousands of pods will eat the cluster's memory.
Tools for Linkerd
- Linkerd (Open Source): Lightweight mesh, zero-config mTLS, fast adoption — Scale: Medium-Large
- Linkerd Buoyant Enterprise (Commercial): Enterprise support, compliance features, lifecycle automation — Scale: Large-Enterprise
- Istio (Open Source): Full-featured mesh when you need advanced traffic management — Scale: Large-Enterprise
- Cilium Service Mesh (Open Source): eBPF-based mesh, no sidecar, kernel-level performance — Scale: Medium-Enterprise
Related to Linkerd
Service Mesh, Load Balancer, Kubernetes Architecture, Distributed Tracing, DNS & Service Discovery, Service Discovery & Registration
Load Balancer — Networking & Traffic
Difficulty: Intermediate
Key Points for Load Balancer
- Distributes incoming traffic across multiple backend servers to prevent overload
- L4 (transport) vs L7 (application): different layers, different tradeoffs
- Provides horizontal scaling, fault tolerance, and zero-downtime deployments
- Health checks automatically pull unhealthy backends out of the pool
- Session affinity (sticky sessions) vs stateless backends is an architectural decision that needs to be made early
Common Mistakes with Load Balancer
- Using sticky sessions without thinking through the failure mode. When that backend dies, session data goes with it.
- Getting health check intervals wrong. Too fast and backends get overwhelmed, too slow and failover takes forever.
- Skipping connection draining during deployments, which drops in-flight requests on the floor
- Running round robin across servers with different hardware specs. They all get equal load regardless of capacity.
- Forgetting about keep-alive connection imbalance. Long-lived connections skew distribution in surprising ways.
Tools for Load Balancer
- AWS ALB/NLB (Managed): Cloud-native, auto-scaling integration — Scale: Small-Enterprise
- HAProxy (Open Source): High-performance L4/L7, battle-tested — Scale: Medium-Enterprise
- Envoy (Open Source): Service mesh, advanced observability — Scale: Large-Enterprise
- NGINX (Open Source): Web server + reverse proxy + LB — Scale: Small-Enterprise
Related to Load Balancer
API Gateway, WebSocket Gateway, CDN & Edge Computing, Auto-Scaling Patterns, DNS & Service Discovery
Message Queues & Event Streaming — Data & Storage
Difficulty: Intermediate
Key Points for Message Queues & Event Streaming
- Decouples producers from consumers, allowing asynchronous processing and better fault tolerance
- Message queues (RabbitMQ, SQS) deliver each message to one consumer. Event streams (Kafka) let multiple consumers read the same data independently
- At-least-once vs exactly-once vs at-most-once: the delivery guarantee choice shapes the entire application design
- Consumer groups provide parallel processing while keeping order within each partition
- Dead letter queues catch failed messages for debugging and reprocessing later
Common Mistakes with Message Queues & Event Streaming
- Not making consumers idempotent. At-least-once delivery means messages can and will be processed more than once
- Letting queues grow without bounds. Producers outpace consumers, the queue fills the disk, and the broker dies
- Reaching for a message queue when a plain function call would do the job. Don't add unnecessary complexity
- Ignoring consumer lag. If consumers fall behind, there's a scaling or performance problem, and it's better to know about it before the users do
- Assuming ordering across partitions. Kafka only guarantees order within a single partition, not across them
Tools for Message Queues & Event Streaming
- Apache Kafka (Open Source): Event streaming, high throughput, replay capability — Scale: Large-Enterprise
- RabbitMQ (Open Source): Traditional messaging, routing patterns, AMQP — Scale: Small-Large
- AWS SQS/SNS (Managed): Serverless, zero ops, fan-out patterns — Scale: Small-Enterprise
- Apache Pulsar (Open Source): Multi-tenancy, tiered storage, geo-replication — Scale: Large-Enterprise
Related to Message Queues & Event Streaming
Replication & Consistency, Caching Strategies, Distributed Logging, CI/CD Pipeline Design
Metrics & Monitoring — Observability & Reliability
Difficulty: Intermediate
Key Points for Metrics & Monitoring
- Of the three observability pillars (metrics, logs, traces), metrics are where alerting starts
- USE method (Utilization, Saturation, Errors) for infrastructure; RED method (Rate, Errors, Duration) for services
- Prometheus pulls from /metrics endpoints; push-based systems (Datadog, StatsD) receive data from agents
- Cardinality explosion will kill the metric system. Never use unbounded label values like user IDs or request IDs.
- SLIs, SLOs, and error budgets turn raw numbers into reliability commitments the business can actually understand
Common Mistakes with Metrics & Monitoring
- Building 50 dashboards before writing a single alert. Dashboards help investigate. Alerts detect problems.
- Alerting on symptoms like high CPU instead of user impact like elevated error rate or high latency
- Skipping SLO definition before building monitoring. Without defining 'healthy,' there's nothing to measure.
- High-cardinality labels causing Prometheus OOM. Every unique label combo creates a new time series.
- Forgetting to monitor the monitoring system. Prometheus needs health checks and alerts on itself.
Tools for Metrics & Monitoring
- Prometheus (Open Source): Kubernetes-native, PromQL, pull-based — Scale: Medium-Enterprise
- Datadog (Commercial): Unified observability, APM, easy setup — Scale: Small-Enterprise
- Grafana + Mimir (Open Source): Long-term storage, multi-tenant Prometheus — Scale: Large-Enterprise
- Victoria Metrics (Open Source): High-performance TSDB, PromQL-compatible — Scale: Medium-Enterprise
Related to Metrics & Monitoring
Distributed Logging, Distributed Tracing, Alerting & On-Call, Auto-Scaling Patterns
Object Storage & Data Lake — Data & Storage
Difficulty: Intermediate
Key Points for Object Storage & Data Lake
- Stores unstructured data (files, images, logs, backups) as objects with metadata in flat namespaces
- Virtually unlimited scalability. S3 stores over 200 trillion objects
- 11 nines (99.999999999%) durability via erasure coding and cross-region replication
- Data lakes layer structured query engines (Athena, Presto, Spark) over raw object storage
- Storage tiering (hot/warm/cold/archive) keeps costs sane. Lifecycle policies automate transitions
Common Mistakes with Object Storage & Data Lake
- Storing small objects individually. High request overhead per object, so batch into larger files
- Not enabling versioning. Accidental deletes or overwrites become irrecoverable
- Ignoring storage class optimization. Keeping cold data in hot tier wastes 60-80% on storage costs
- Not using multipart upload for large files. Single PUT fails silently on network issues
- Flat namespace without key prefix strategy. Poor prefix design causes throttling (S3 partitions by prefix)
Tools for Object Storage & Data Lake
- AWS S3 (Managed): De facto standard, broadest ecosystem integration — Scale: Small-Enterprise
- MinIO (Open Source): S3-compatible on-premise, Kubernetes-native — Scale: Medium-Enterprise
- GCS (Managed): BigQuery integration, strong consistency — Scale: Small-Enterprise
- Azure Blob Storage (Managed): Azure ecosystem, ADLS Gen2 for analytics — Scale: Small-Enterprise
Related to Object Storage & Data Lake
Database Sharding, Distributed Logging, Compliance & Audit Logging, CI/CD Pipeline Design
Rate Limiting & Throttling — Networking & Traffic
Difficulty: Intermediate
Key Points for Rate Limiting & Throttling
- Caps request rates to protect services from abuse, DDoS, and resource exhaustion
- Token bucket, sliding window, and leaky bucket are the three core algorithms worth knowing
- Distributed rate limiting needs shared state (Redis). Local-only limits apply per instance, which is often not the intended behavior
- Always communicate limits through standard headers: X-RateLimit-Limit, Remaining, Reset
- Set different limits per tier. Free users, paid users, and internal services should not share the same budget
Common Mistakes with Rate Limiting & Throttling
- Using per-instance rate limiting instead of global. Clients can bypass it by hitting different instances
- Not separating authenticated and unauthenticated rate limits
- Setting limits too tight at launch, then throttling legitimate traffic before real usage data exists
- Skipping the Retry-After header. Clients retry immediately and the result is a thundering herd
- Rate limiting by IP only. Shared IPs behind NAT or corporate proxies punish every user behind them
Tools for Rate Limiting & Throttling
- Redis + Lua (Open Source): Distributed counters, atomic operations — Scale: Medium-Enterprise
- Envoy Rate Limit (Open Source): Service mesh integration, per-route limits — Scale: Large-Enterprise
- Kong Rate Limiting (Open Source): API gateway plugin, Redis-backed — Scale: Medium-Enterprise
- AWS WAF (Managed): Edge rate limiting, IP-based rules — Scale: Small-Enterprise
Related to Rate Limiting & Throttling
API Gateway, Load Balancer, Caching Strategies
Replication & Consistency — Data & Storage
Difficulty: Advanced
Key Points for Replication & Consistency
- Copies data across multiple nodes for fault tolerance and read scalability
- CAP theorem constrains the choices. Partition-tolerant systems must pick between consistency and availability
- Synchronous replication guarantees consistency but adds real write latency
- Eventual consistency works fine for most read-heavy workloads when the application is designed around it
- Consensus protocols (Raft, Paxos) handle leader election and keep replicated state machines in sync
Common Mistakes with Replication & Consistency
- Reading from async replicas right after writing. Classic read-your-own-writes violation
- Treating eventual consistency as 'eventually correct.' Conflicts still need explicit resolution
- Ignoring replication lag. Stale reads cause subtle bugs that are painful to track down
- Running synchronous replication across data centers. The latency will destroy write throughput
- Never testing failover. Promoting a replica to primary should be rehearsed, not figured out during a 3 AM incident
Tools for Replication & Consistency
- PostgreSQL (Open Source): Streaming replication, synchronous commit options — Scale: Small-Enterprise
- CockroachDB (Open Source): Raft-based automatic replication, strong consistency — Scale: Medium-Enterprise
- Cassandra (Open Source): Tunable consistency, multi-DC replication — Scale: Large-Enterprise
- TiDB (Open Source): MySQL-compatible, Raft-based, HTAP — Scale: Large-Enterprise
Related to Replication & Consistency
Database Sharding, Message Queues & Event Streaming, Kubernetes Architecture
Secrets Management — Security & Governance
Difficulty: Intermediate
Key Points for Secrets Management
- Centralized secret storage with encryption at rest and fine-grained access control
- Secrets should never live in code, environment variables, or container images
- Dynamic secrets (generated on-demand with a TTL) beat static credentials for security
- Automatic rotation shrinks the blast radius when credentials get compromised
- Audit logging tracks who accessed which secret and when, which is non-negotiable for compliance
Common Mistakes with Secrets Management
- Storing secrets in environment variables. They leak into logs, crash dumps, and child processes.
- Committing secrets to git. Even after deletion, they persist in git history forever.
- Not rotating secrets after an employee leaves or a breach is suspected
- Using the same credentials across environments (dev/staging/prod)
- Not encrypting secrets at rest in the secret store. Defense in depth applies here too.
Tools for Secrets Management
- HashiCorp Vault (Open Source): Dynamic secrets, PKI, transit encryption — Scale: Medium-Enterprise
- AWS Secrets Manager (Managed): AWS-native, automatic RDS rotation — Scale: Small-Enterprise
- External Secrets Operator (Open Source): Sync cloud secrets into K8s Secrets — Scale: Medium-Enterprise
- CyberArk Conjur (Commercial): Enterprise PAM, compliance-focused — Scale: Enterprise
Related to Secrets Management
Zero Trust & Network Security, Compliance & Audit Logging, Container Runtime & Docker, GitOps & Infrastructure as Code
Serverless & FaaS — Compute & Orchestration
Difficulty: Intermediate
Key Points for Serverless & FaaS
- Run code without managing servers. The cloud provider handles provisioning, scaling, and patching.
- Pay-per-invocation pricing means zero cost at zero traffic, but it gets expensive fast at sustained high throughput.
- Cold start latency (100ms-10s) is the biggest trade-off. Provisioned concurrency helps, but costs money.
- Great for event-driven, bursty, short-lived workloads. Poor fit for long-running processes.
- Vendor lock-in is real. Lambda, Cloud Functions, and Azure Functions each have different APIs and limits.
Common Mistakes with Serverless & FaaS
- Using serverless for latency-sensitive synchronous APIs without provisioned concurrency
- Ignoring concurrent execution limits, then getting throttled when traffic spikes
- Building monolithic functions that do too much. Each function should do one thing.
- Forgetting cold start impact on P99 latency. That 1% of requests can be 10x slower than normal.
- Skipping timeout and memory limits. Runaway functions will burn through the budget.
Tools for Serverless & FaaS
- AWS Lambda (Managed): Broadest event source integration, mature ecosystem — Scale: Small-Enterprise
- Cloudflare Workers (Managed): Edge execution, V8 isolates, sub-ms cold start — Scale: Small-Enterprise
- Google Cloud Functions (Managed): GCP integration, Cloud Run for containers — Scale: Small-Enterprise
- Knative (Open Source): Serverless on Kubernetes, no vendor lock-in — Scale: Medium-Enterprise
Related to Serverless & FaaS
API Gateway, Auto-Scaling Patterns, CI/CD Pipeline Design, Metrics & Monitoring
Service Discovery & Registration — Compute & Orchestration
Difficulty: Intermediate
Key Points for Service Discovery & Registration
- Services need to find each other dynamically when IPs change constantly in containerized environments
- Self-registration means the service handles its own registry updates. Third-party registration offloads that to an external agent.
- Client-side discovery gives callers control over load balancing. Server-side is simpler but adds a hop.
- Health-aware routing pulls unhealthy instances out of the pool automatically
- In Kubernetes, Services and Endpoints are the built-in mechanism, backed by kube-proxy and CoreDNS
Common Mistakes with Service Discovery & Registration
- Hardcoding service endpoints in config files. This breaks the moment instances change.
- Skipping graceful deregistration, so clients keep routing to terminated instances
- Ignoring DNS TTL caching. Stale DNS records send requests to dead instances.
- Not separating liveness from readiness. A service that is still booting should not receive traffic.
- Running a single registry instance, which turns the registry itself into a single point of failure
Tools for Service Discovery & Registration
- Kubernetes Services (Open Source): Built-in K8s discovery, ClusterIP/NodePort/LoadBalancer — Scale: Medium-Enterprise
- Consul (Open Source): Multi-platform, health checks, KV config — Scale: Medium-Enterprise
- Eureka (Open Source): Spring Cloud ecosystem, AP semantics — Scale: Medium-Large
- Nacos (Open Source): Service discovery + config management, popular in Java — Scale: Medium-Large
Related to Service Discovery & Registration
DNS & Service Discovery, Kubernetes Architecture, Service Mesh, Load Balancer
Service Mesh — Networking & Traffic
Difficulty: Advanced
Key Points for Service Mesh
- A dedicated infrastructure layer that handles service-to-service communication through sidecar proxies
- Handles mTLS, retries, circuit breaking, and observability without touching application code
- Data plane (Envoy sidecars) moves traffic; control plane (Istiod) manages configuration
- Adds roughly 2-5ms latency per hop because of the sidecar proxy. Know the trade-off before committing.
- Starts paying for itself around 50+ services. Adopting it too early just adds complexity for no real gain.
Common Mistakes with Service Mesh
- Adopting a service mesh with fewer than 20 services. The operational overhead just isn't worth it at that scale.
- Forgetting the per-pod resource cost. Each Envoy sidecar eats 50-100MB RAM, and that adds up fast.
- Thinking the mesh handles all security. It provides transport encryption, not application-level authorization.
- Leaving retry policies at defaults. Untuned retries will amplify failures into a retry storm.
- Treating the control plane as a fire-and-forget component. If Istiod goes down, new proxy configs stop flowing.
Tools for Service Mesh
- Istio (Open Source): Full-featured mesh, Envoy-based, large community — Scale: Large-Enterprise
- Linkerd (Open Source): Lightweight, Rust proxy, simpler operations — Scale: Medium-Large
- Consul Connect (Open Source): Multi-platform (K8s + VMs), integrated service discovery — Scale: Medium-Enterprise
- AWS App Mesh (Managed): AWS-native, ECS/EKS integration — Scale: Medium-Enterprise
Related to Service Mesh
Istio, Linkerd, API Gateway, Load Balancer, Kubernetes Architecture, Distributed Tracing, Zero Trust & Network Security
WebSocket Gateway — Networking & Traffic
Difficulty: Advanced
Key Points for WebSocket Gateway
- Sits between the load balancer and sync servers, terminating the WebSocket protocol and managing connection lifecycle
- ALB terminates TLS and routes the initial HTTP Upgrade, but after the upgrade it becomes a byte-forwarding proxy. The gateway is where protocol intelligence lives
- Two patterns: proxy mode (forward bytes, route at connection time) vs termination mode (decode every frame, route per message). Most collaboration systems use proxy mode
- Maintains a connection registry mapping sockets to users, enabling targeted message delivery instead of broadcast
- Proxy mode scales at O(connections), termination mode at O(messages). Message volume is always much larger than connection volume
Common Mistakes with WebSocket Gateway
- Confusing the ALB with the gateway. After the WebSocket upgrade, ALB is a TCP pipe. It cannot inspect frames, enforce per-message auth, or route based on payload content.
- Not separating gateway from sync servers. Combining them works at small scale, but connection management cannot scale independently from application logic.
- Skipping connection draining on gateway deploys. Killing a gateway pod drops all its WebSocket connections and triggers a thundering herd reconnect.
- Setting ALB idle timeout too low. ALB closes connections idle beyond its timeout (default 60s). Set it to 3600s for WebSocket and rely on application-level ping/pong instead.
- Building message routing without a connection registry. Without knowing which gateway holds which user, the system falls back to broadcasting every message to every gateway at O(N) cost.
Tools for WebSocket Gateway
- NGINX (Open Source): WebSocket proxying with upstream routing, proven at scale — Scale: Medium-Enterprise
- Envoy (Open Source): Dynamic routing, gRPC + WebSocket, observability built in — Scale: Large-Enterprise
- AWS API Gateway WebSocket (Managed): Serverless WebSocket, Lambda per-message dispatch, zero ops — Scale: Small-Large
- Hocuspocus (Open Source): Yjs/CRDT-aware WebSocket server, collaborative editing — Scale: Small-Medium
Related to WebSocket Gateway
WebSocket & Real-Time Communication, Load Balancer, API Gateway, Auto-Scaling Patterns
WebSocket & Real-Time Communication — Networking & Traffic
Difficulty: Advanced
Key Points for WebSocket & Real-Time Communication
- WebSocket provides full-duplex, persistent connections over a single TCP socket, killing HTTP polling overhead entirely
- Connection lifecycle: HTTP Upgrade handshake, bidirectional frames, ping/pong keepalive, close handshake
- SSE (Server-Sent Events) is the simpler option for server-to-client push. It works with HTTP/2 and auto-reconnects out of the box
- A connection registry (Redis) solves the routing problem: an event can land on any server, but the connection lives on one specific pod
- Graceful connection draining during deploys prevents mass reconnect thundering herds
Common Mistakes with WebSocket & Real-Time Communication
- Using L7 HTTP load balancers that buffer requests. WebSocket needs L4 (TCP) pass-through or L7 with explicit upgrade support.
- Broadcasting events to all pods instead of routing to the right one. That is O(N) fan-out when O(1) targeted delivery is possible.
- Skipping ping/pong keepalive. Silent TCP connection drops go undetected and stale connections pile up.
- Deploying without graceful drain. A rolling restart drops all connections at once and triggers a thundering herd reconnect storm.
- Storing connection state only in local memory. When a pod dies, all connection metadata is gone with no recovery path.
Tools for WebSocket & Real-Time Communication
- Socket.IO (Open Source): Auto-reconnect, fallback transports, rooms/namespaces — Scale: Small-Medium
- ws (Node.js) (Open Source): Minimal WebSocket server, high performance, no abstraction — Scale: Medium-Enterprise
- Envoy/NGINX (Open Source): L4/L7 proxy with WebSocket support, connection draining — Scale: Medium-Enterprise
- AWS API Gateway WebSocket (Managed): Serverless WebSocket with Lambda integration, connection management — Scale: Small-Large
Related to WebSocket & Real-Time Communication
WebSocket Gateway, Load Balancer, API Gateway, Service Mesh, Auto-Scaling Patterns
XDP (eXpress Data Path) — Networking & Traffic
Difficulty: Advanced
Key Points for XDP (eXpress Data Path)
- Processes packets at the NIC driver level before the kernel networking stack even sees them
- Runs as an eBPF program, using the same toolchain and deployment model teams already know
- No dedicated CPU cores needed. Lightweight enough to run on every production server
- 5-10M packets/sec per core vs roughly 1M with normal send() syscalls
- Used by Cloudflare for DDoS mitigation, Facebook for load balancing, and Cilium for Kubernetes networking
Common Mistakes with XDP (eXpress Data Path)
- Assuming XDP bypasses the kernel entirely. It does not. It hooks into the NIC driver inside the kernel, just before the networking stack. Full kernel bypass is DPDK territory.
- Writing complex stateful logic in XDP programs. eBPF programs have size limits and restricted loops. Keep XDP programs simple: filter, forward, or redirect. Do complex processing in userspace.
- Not checking NIC driver support. XDP works best in native mode (driver support). Generic mode (no driver support) is much slower and defeats the purpose.
- Forgetting that XDP runs per-packet. At 10M packets/sec, even a small per-packet overhead adds up fast. Profile XDP programs.
- Deploying without a fallback. If the XDP program crashes or has a bug, packets get dropped. Always have a health check that detaches the program if it misbehaves.
Tools for XDP (eXpress Data Path)
- XDP + eBPF (Open Source): Packet filtering, forwarding, and sampling at NIC driver level — Scale: Medium-Enterprise
- tc/BPF (Open Source): Traffic shaping and classification after the kernel stack — Scale: Small-Enterprise
- iptables / nftables (Open Source): Traditional firewall rules, simpler setups — Scale: Small-Medium
- AF_XDP (Open Source): Zero-copy packet delivery from NIC to userspace applications — Scale: Medium-Enterprise
Related to XDP (eXpress Data Path)
DPDK (Data Plane Development Kit), Load Balancer, Service Mesh
Zero Trust & Network Security — Security & Governance
Difficulty: Advanced
Key Points for Zero Trust & Network Security
- Every request gets authenticated and authorized regardless of where it originates. Being 'inside' the network means nothing.
- Micro-segmentation replaces the old network perimeter. Each service-to-service call is policy-controlled.
- mTLS provides both encryption and identity verification between services
- Kubernetes network policies restrict pod-to-pod communication at the CNI level
- Defense in depth matters. Combine network policies, service mesh mTLS, and application-level authz together.
Common Mistakes with Zero Trust & Network Security
- Relying only on network perimeter security. Once inside, attackers move laterally with zero resistance.
- Deploying mTLS without certificate rotation. Expired certs cause outages, and they always expire at the worst time.
- Leaving network policies wide open. An 'allow all' default defeats the entire purpose.
- Skipping audit trails on policy changes. Security policies need the same review process as application code.
- Ignoring east-west traffic. Most real attacks exploit service-to-service communication, not north-south.
Tools for Zero Trust & Network Security
- Istio (Open Source): mTLS, authorization policies, service identity (SPIFFE) — Scale: Large-Enterprise
- Calico (Open Source): K8s network policies, eBPF dataplane — Scale: Medium-Enterprise
- Open Policy Agent (Open Source): Policy-as-code, admission control, authz decisions — Scale: Medium-Enterprise
- Cilium (Open Source): eBPF networking, L7 visibility, network policies — Scale: Medium-Enterprise
Related to Zero Trust & Network Security
Service Mesh, Secrets Management, API Gateway, Compliance & Audit Logging