Alerting & On-Call — Observability & Reliability
Difficulty: Intermediate
Key Points for Alerting & On-Call
- Alerts should be actionable — every alert that fires should require human intervention
- Alert on SLO burn rate, not raw metrics — 'error budget consumed 10% in 1 hour' is more useful than 'error rate > 1%'
- Escalation policies ensure alerts reach the right person — primary → secondary → manager → incident commander
- Runbooks document response procedures for each alert — reduce MTTR by giving responders a playbook
- On-call rotations should be sustainable — no more than 2 pages per shift, or the system needs fixing
Common Mistakes with Alerting & On-Call
- Alert fatigue — too many non-actionable alerts cause responders to ignore real incidents
- No alert deduplication or grouping — 100 alerts for the same incident overwhelm the on-call
- Missing runbooks — alert fires at 3 AM and the on-call has no idea what to do
- Not tracking alert-to-incident ratio — high ratio means too many false positives
- Alerting only on error rate without considering traffic volume — 1 error in 2 requests is 50% error rate
Tools for Alerting & On-Call
- PagerDuty (Commercial): Incident management, escalation policies, integrations — Scale: Medium-Enterprise
- OpsGenie (Commercial): Atlassian integration, alert routing, on-call scheduling — Scale: Medium-Enterprise
- Prometheus Alertmanager (Open Source): Prometheus-native, grouping, silencing, inhibition — Scale: Medium-Enterprise
- Grafana OnCall (Open Source): Grafana-native, ChatOps, escalation chains — Scale: Small-Large
Related to Alerting & On-Call
Metrics & Monitoring, Distributed Logging, Distributed Tracing
API Gateway — Networking & Traffic
Difficulty: Intermediate
Key Points for API Gateway
- Single entry point for all client requests — centralizes cross-cutting concerns
- Handles authentication, rate limiting, routing, and protocol translation
- Can aggregate multiple microservice calls into a single client response
- Critical path component — must be highly available and low-latency
- Decouples client interface from internal service topology
Common Mistakes with API Gateway
- Single point of failure without redundancy — always deploy at least two instances behind a load balancer
- Putting business logic in the gateway layer — keep it thin, route and validate only
- Not implementing circuit breakers for downstream service failures
- Ignoring tail latency — gateway adds P99 overhead to every request
- Skipping request/response transformation versioning, breaking clients on deploy
Tools for API Gateway
- Kong (Open Source): Plugin ecosystem, Lua extensibility — Scale: Medium-Enterprise
- AWS API Gateway (Managed): Serverless, Lambda integration — Scale: Small-Enterprise
- Envoy (Open Source): Service mesh sidecar, gRPC-native — Scale: Large-Enterprise
- NGINX (Open Source): High-performance reverse proxy — Scale: Small-Enterprise
Related to API Gateway
Load Balancer, Service Mesh, Rate Limiting & Throttling, DNS & Service Discovery
Artifact Management & Container Registry — CI/CD & Deployment
Difficulty: Intermediate
Key Points for Artifact Management & Container Registry
- Central repository for build artifacts (container images, packages, binaries) with versioning and access control
- Container registries store OCI images with layer deduplication — only changed layers are pushed/pulled
- Image scanning detects CVEs in base images and dependencies before deployment
- Immutable tags (use SHA digests, not :latest) ensure reproducible deployments
- Proximity matters — registry in the same region as your cluster reduces pull times significantly
Common Mistakes with Artifact Management & Container Registry
- Using :latest tag in production — you cannot deterministically reproduce a deployment
- Not scanning images before deployment — known CVEs make it to production
- Storing secrets in image layers — they persist in layer history forever
- No image retention policy — registry storage grows unbounded, costs increase
- Pulling from public registries in CI — rate limits and outages break your pipeline
Tools for Artifact Management & Container Registry
- AWS ECR (Managed): EKS integration, lifecycle policies, scanning — Scale: Small-Enterprise
- Harbor (Open Source): Enterprise registry, replication, RBAC, scanning — Scale: Medium-Enterprise
- Docker Hub (Managed): Public images, community ecosystem — Scale: Small-Medium
- GitHub Packages (Managed): GitHub-native, multi-format (npm, Docker, Maven) — Scale: Small-Large
Related to Artifact Management & Container Registry
CI/CD Pipeline Design, Container Runtime & Docker, Secrets Management, Kubernetes Architecture
Auto-Scaling Patterns — Compute & Orchestration
Difficulty: Advanced
Key Points for Auto-Scaling Patterns
- Automatically adjusts compute capacity based on demand — scales out on load, scales in on idle
- Reactive scaling (metric thresholds) vs predictive scaling (ML-based forecasting) vs scheduled scaling
- Horizontal Pod Autoscaler (HPA) scales pods; Cluster Autoscaler scales nodes; Karpenter replaces both
- Scale-out is fast; scale-in must be conservative — premature scale-in causes oscillation
- Custom metrics (queue depth, business KPIs) are often better scaling signals than CPU/memory
Common Mistakes with Auto-Scaling Patterns
- Scaling on CPU only — high CPU doesn't always mean the service needs more instances
- Setting scale-in cooldown too short — causes thrashing (scale out, scale in, scale out)
- Not accounting for pod startup time in scaling decisions — new pods aren't ready for 30-60s
- Ignoring cluster autoscaler lag — node provisioning takes 2-5 minutes on most cloud providers
- Not using PodDisruptionBudgets during scale-in — terminating too many pods disrupts availability
Tools for Auto-Scaling Patterns
- Kubernetes HPA (Open Source): Pod-level scaling, custom metrics API — Scale: Medium-Enterprise
- Karpenter (Open Source): Fast node provisioning, instance type selection — Scale: Medium-Enterprise
- AWS Auto Scaling (Managed): EC2/ECS scaling, target tracking policies — Scale: Small-Enterprise
- KEDA (Open Source): Event-driven scaling, scale-to-zero — Scale: Medium-Enterprise
Related to Auto-Scaling Patterns
Kubernetes Architecture, Metrics & Monitoring, Load Balancer, Serverless & FaaS
Caching Strategies — Data & Storage
Difficulty: Intermediate
Key Points for Caching Strategies
- Reduces database load and latency by serving frequently accessed data from fast in-memory stores
- Cache-aside, read-through, write-through, write-behind — each pattern suits different access patterns
- Cache invalidation is famously hard — TTL, event-driven invalidation, and versioned keys are the main approaches
- Cache stampede (thundering herd) occurs when many requests simultaneously miss cache — use locking or stale-while-revalidate
- Multi-tier caching (L1 in-process, L2 distributed) balances latency and consistency
Common Mistakes with Caching Strategies
- Caching without setting TTL — stale data persists indefinitely
- Using cache as primary data store — cache eviction causes data loss
- Not handling cache failures gracefully — if Redis is down, app should fall back to database
- Cache key collisions — poorly designed keys cause different data to overwrite each other
- Ignoring cache warming on deploy — cold cache after restart causes database spike
Tools for Caching Strategies
- Redis (Open Source): Rich data structures, pub/sub, persistence options — Scale: Small-Enterprise
- Memcached (Open Source): Simple key-value, multi-threaded, maximum throughput — Scale: Medium-Enterprise
- Caffeine (Open Source): JVM in-process cache, near-optimal hit ratio — Scale: Small-Large
- Hazelcast (Open Source): Distributed cache, embedded or client-server — Scale: Medium-Enterprise
Related to Caching Strategies
CDN & Edge Computing, Database Sharding, Rate Limiting & Throttling
CDN & Edge Computing — Networking & Traffic
Difficulty: Intermediate
Key Points for CDN & Edge Computing
- Caches content at geographically distributed edge nodes close to users
- Reduces origin server load and improves latency by 50-90% for static assets
- Edge computing extends beyond caching — run logic at the edge with Workers/Lambda@Edge
- Cache invalidation is the hardest problem — TTL, purge APIs, stale-while-revalidate
- Shield/origin-shield pattern reduces thundering herd on cache misses
Common Mistakes with CDN & Edge Computing
- Setting overly long TTLs without a purge strategy — stale content persists globally
- Caching responses with Set-Cookie headers — serving one user's session to another
- Not varying cache keys on relevant headers (Accept-Encoding, Accept-Language)
- Ignoring cache hit ratio metrics — low hit ratios mean the CDN is just adding latency
- Not using origin shield — N edge PoPs each independently fetching from origin on miss
Tools for CDN & Edge Computing
- CloudFront (Managed): AWS ecosystem, Lambda@Edge — Scale: Small-Enterprise
- Cloudflare (Managed): Edge Workers, DDoS protection, massive PoP network — Scale: Small-Enterprise
- Fastly (Managed): Instant purge, VCL customization, real-time logging — Scale: Medium-Enterprise
- Akamai (Commercial): Largest network, media delivery, enterprise SLAs — Scale: Enterprise
Related to CDN & Edge Computing
Load Balancer, Caching Strategies, DNS & Service Discovery
CI/CD Pipeline Design — CI/CD & Deployment
Difficulty: Intermediate
Key Points for CI/CD Pipeline Design
- Continuous Integration merges code frequently; Continuous Delivery automates release; Continuous Deployment auto-deploys
- Pipeline stages: lint → test → build → security scan → deploy to staging → integration test → deploy to prod
- Fast feedback loops are critical — aim for <10 min from commit to test results
- Trunk-based development with short-lived feature branches minimizes merge conflicts
- Pipeline as code (Jenkinsfile, .github/workflows) ensures reproducibility and auditability
Common Mistakes with CI/CD Pipeline Design
- Not parallelizing independent test suites — sequential execution wastes minutes per build
- Running all tests on every PR — use affected/changed file detection for large monorepos
- Manual deployment steps in the pipeline — defeats the purpose of automation
- Not caching dependencies between builds — npm install from scratch adds 2-5 minutes
- Sharing mutable state between pipeline stages — flaky tests from leftover state
Tools for CI/CD Pipeline Design
- GitHub Actions (Managed): GitHub-native, marketplace actions, matrix builds — Scale: Small-Enterprise
- GitLab CI (Open Source): Integrated DevOps platform, self-hosted option — Scale: Medium-Enterprise
- Jenkins (Open Source): Maximum flexibility, plugin ecosystem, self-hosted — Scale: Medium-Enterprise
- CircleCI (Managed): Docker-first, fast builds, orbs ecosystem — Scale: Small-Large
Related to CI/CD Pipeline Design
Deployment Strategies, GitOps & Infrastructure as Code, Artifact Management & Container Registry, Container Runtime & Docker
Compliance & Audit Logging — Security & Governance
Difficulty: Advanced
Key Points for Compliance & Audit Logging
- Audit logs record who did what, when, and from where — the forensic trail for security incidents
- Immutable, append-only storage ensures audit logs cannot be tampered with
- Compliance frameworks (SOC 2, GDPR, HIPAA, PCI-DSS) mandate specific logging and retention requirements
- Separation of duties — engineers who deploy code should not be able to modify audit logs
- Automated compliance checks in CI/CD catch policy violations before they reach production
Common Mistakes with Compliance & Audit Logging
- Audit logs stored in the same system they audit — compromised system can delete its own audit trail
- Not logging failed authentication attempts — these are often the first sign of an attack
- Insufficient log retention — compliance requires 1-7 years depending on framework
- No automated alerting on suspicious audit events — logs are only useful if someone reads them
- Logging too much PII in audit records — audit logs themselves become a compliance liability
Tools for Compliance & Audit Logging
- AWS CloudTrail (Managed): AWS API audit logging, S3 integration — Scale: Small-Enterprise
- Falco (Open Source): Runtime security, K8s audit, eBPF-based — Scale: Medium-Enterprise
- Splunk (Commercial): Enterprise SIEM, compliance reporting, SPL queries — Scale: Large-Enterprise
- Open Policy Agent (Open Source): Policy enforcement, admission webhooks, audit — Scale: Medium-Enterprise
Related to Compliance & Audit Logging
Secrets Management, Zero Trust & Network Security, Distributed Logging, Object Storage & Data Lake
Container Runtime & Docker — Compute & Orchestration
Difficulty: Intermediate
Key Points for Container Runtime & Docker
- Containers provide process-level isolation using Linux namespaces and cgroups — not VMs
- OCI standard defines image format and runtime spec — Docker is one implementation among many
- containerd and CRI-O are the dominant runtimes in Kubernetes — Docker (dockershim) was removed in K8s 1.24
- Image layers are copy-on-write — shared base layers save disk and speed up pulls
- Multi-stage builds reduce image size by separating build dependencies from runtime
Common Mistakes with Container Runtime & Docker
- Running containers as root — compromised container gets host-level access
- Using large base images (ubuntu:latest is 77MB) when distroless or alpine suffices (5MB)
- Storing secrets in image layers — they persist in layer history even if deleted in later layers
- Not pinning base image versions — FROM python:3 pulls different images over time
- Ignoring .dockerignore — build context includes unnecessary files, slowing builds
Tools for Container Runtime & Docker
- containerd (Open Source): Kubernetes default runtime, CNCF graduated — Scale: Medium-Enterprise
- Docker Engine (Open Source): Developer experience, Docker Compose, build tooling — Scale: Small-Large
- CRI-O (Open Source): Minimal K8s-focused runtime, OpenShift default — Scale: Medium-Enterprise
- Podman (Open Source): Rootless containers, daemonless, Docker CLI-compatible — Scale: Small-Medium
Related to Container Runtime & Docker
Kubernetes Architecture, CI/CD Pipeline Design, Artifact Management & Container Registry, Secrets Management
Database Sharding — Data & Storage
Difficulty: Advanced
Key Points for Database Sharding
- Horizontally partitions data across multiple database instances by a shard key
- Shard key selection is the most critical decision — wrong key causes hotspots and cross-shard queries
- Range-based vs hash-based vs directory-based sharding — each has distinct trade-offs
- Resharding (adding/removing shards) is operationally dangerous — plan capacity ahead
- Cross-shard transactions require two-phase commit or saga patterns — avoid when possible
Common Mistakes with Database Sharding
- Choosing a shard key with low cardinality — all data ends up on one shard
- Not planning for resharding — data growth makes initial shard count insufficient
- Designing queries that require cross-shard joins — defeats the purpose of sharding
- Using auto-increment IDs as shard keys — creates write hotspots on the latest shard
- Not testing shard failure scenarios — losing one shard should not bring down the entire system
Tools for Database Sharding
- Vitess (Open Source): MySQL sharding, used by YouTube/Slack — Scale: Large-Enterprise
- CockroachDB (Open Source): Auto-sharding, distributed SQL, strong consistency — Scale: Medium-Enterprise
- Citus (PostgreSQL) (Open Source): PostgreSQL extension, transparent sharding — Scale: Medium-Enterprise
- MongoDB (Open Source): Native sharding with config servers and mongos — Scale: Medium-Enterprise
Related to Database Sharding
Replication & Consistency, Caching Strategies, Load Balancer
Deployment Strategies — CI/CD & Deployment
Difficulty: Intermediate
Key Points for Deployment Strategies
- Blue-green deploys switch traffic between two identical environments — instant rollback
- Canary releases gradually shift traffic (1% → 5% → 25% → 100%) to detect issues early
- Rolling updates replace instances one-by-one — Kubernetes default strategy
- Feature flags decouple deployment from release — deploy dark, enable for specific users
- Database migrations must be backward-compatible — the old code runs alongside new code during rollout
Common Mistakes with Deployment Strategies
- Not testing rollback procedures — rollback fails when you need it most
- Running incompatible database migrations — breaking old code during rolling deploy
- Canary without automated analysis — humans can't watch dashboards 24/7
- Blue-green without sufficient capacity — need 2x infrastructure during deployment
- Ignoring deployment velocity — infrequent large deploys are riskier than frequent small ones
Tools for Deployment Strategies
- Argo Rollouts (Open Source): K8s-native canary/blue-green with analysis — Scale: Medium-Enterprise
- Flagger (Open Source): Service mesh integration, automated canary — Scale: Medium-Enterprise
- Spinnaker (Open Source): Multi-cloud deployment pipelines — Scale: Large-Enterprise
- AWS CodeDeploy (Managed): EC2/ECS/Lambda deployments, traffic shifting — Scale: Small-Enterprise
Related to Deployment Strategies
CI/CD Pipeline Design, Load Balancer, Metrics & Monitoring, Auto-Scaling Patterns
Distributed Logging — Observability & Reliability
Difficulty: Intermediate
Key Points for Distributed Logging
- Centralized logging aggregates logs from all services into a single searchable system
- Structured logging (JSON) enables querying and filtering — unstructured text logs are nearly useless at scale
- ELK/EFK stack (Elasticsearch, Fluentd/Logstash, Kibana) is the classic open-source solution
- Log levels (DEBUG, INFO, WARN, ERROR) should be dynamically adjustable without redeploy
- Correlation IDs (trace IDs) tie logs across services for a single request — essential for debugging
Common Mistakes with Distributed Logging
- Logging sensitive data (passwords, tokens, PII) — violates compliance and creates security risks
- Unstructured log messages — grep-based debugging doesn't scale past 10 services
- Not setting log retention policies — storage costs grow unbounded
- Logging too much at INFO level — volume overwhelms the logging pipeline
- Not buffering logs — direct writes to Elasticsearch from every pod creates write amplification
Tools for Distributed Logging
- Grafana Loki (Open Source): Log aggregation without full-text indexing, cost-efficient — Scale: Medium-Enterprise
- Elasticsearch + Kibana (Open Source): Full-text search, complex queries, mature ecosystem — Scale: Medium-Enterprise
- Datadog Logs (Commercial): Unified with metrics/traces, live tail, patterns — Scale: Small-Enterprise
- Fluentd/Fluent Bit (Open Source): Log collection and routing, CNCF graduated — Scale: Medium-Enterprise
Related to Distributed Logging
Metrics & Monitoring, Distributed Tracing, Compliance & Audit Logging, Object Storage & Data Lake
Distributed Tracing — Observability & Reliability
Difficulty: Advanced
Key Points for Distributed Tracing
- Tracks a single request as it flows through multiple services — shows the complete call chain
- Traces consist of spans — each span represents one unit of work with timing and metadata
- OpenTelemetry is the standard for instrumentation — vendor-neutral, supports metrics/logs/traces
- Head-based vs tail-based sampling — tail-based captures errors/slow requests more effectively
- Trace context propagation via W3C Trace Context headers (traceparent, tracestate) is the standard
Common Mistakes with Distributed Tracing
- Tracing 100% of requests in production — storage and processing costs are prohibitive at scale
- Not propagating trace context through message queues — async calls break the trace chain
- Only instrumenting HTTP calls — database queries, cache lookups, and queue operations need spans too
- Ignoring sampling configuration — default head-based sampling misses rare but important errors
- Not correlating traces with logs and metrics — the three pillars should be linked by trace ID
Tools for Distributed Tracing
- Jaeger (Open Source): Distributed tracing, CNCF graduated, mature UI — Scale: Medium-Enterprise
- Grafana Tempo (Open Source): Object storage backend, cost-efficient, TraceQL — Scale: Medium-Enterprise
- OpenTelemetry (Open Source): Vendor-neutral instrumentation SDK and collector — Scale: Small-Enterprise
- Datadog APM (Commercial): Unified observability, service maps, error tracking — Scale: Small-Enterprise
Related to Distributed Tracing
Metrics & Monitoring, Distributed Logging, Service Mesh, Alerting & On-Call
DNS & Service Discovery — Networking & Traffic
Difficulty: Intermediate
Key Points for DNS & Service Discovery
- DNS translates human-readable names to IP addresses — the internet's phone book
- Service discovery enables microservices to find each other dynamically without hardcoded addresses
- Client-side vs server-side discovery — different trade-offs in complexity and load distribution
- TTL management is critical — too short increases DNS load, too long delays failover
- Health-aware DNS (Route 53, Consul) removes unhealthy endpoints from responses
Common Mistakes with DNS & Service Discovery
- Caching DNS results indefinitely in the application — ignoring TTL causes stale routing
- Not implementing client-side retry with re-resolution after connection failures
- Using DNS for load balancing without understanding that clients cache the first resolved IP
- Running service registry as a single instance — it becomes the single point of failure for all services
- Not monitoring DNS resolution latency — slow DNS adds hidden latency to every request
Tools for DNS & Service Discovery
- Consul (Open Source): Service mesh, health checks, KV store — Scale: Medium-Enterprise
- CoreDNS (Open Source): Kubernetes DNS, plugin-based — Scale: Medium-Enterprise
- AWS Route 53 (Managed): Global DNS, health checks, failover routing — Scale: Small-Enterprise
- etcd (Open Source): Kubernetes backing store, strong consistency — Scale: Medium-Large
Related to DNS & Service Discovery
API Gateway, Load Balancer, Service Mesh, Service Discovery & Registration
GitOps & Infrastructure as Code — CI/CD & Deployment
Difficulty: Advanced
Key Points for GitOps & Infrastructure as Code
- Git as the single source of truth for both application and infrastructure configuration
- Declarative infrastructure — define what you want, not how to get there
- Pull-based GitOps (ArgoCD, Flux) vs push-based IaC (Terraform apply in CI)
- Drift detection — continuous reconciliation ensures actual state matches desired state
- Infrastructure changes go through the same PR review process as application code
Common Mistakes with GitOps & Infrastructure as Code
- Manual kubectl apply in production — bypasses audit trail and review process
- Terraform state file without remote backend and locking — concurrent applies corrupt state
- Not using modules/reusable components — copy-paste infrastructure is unmaintainable
- Mixing application deployment with infrastructure provisioning in the same pipeline
- Not testing infrastructure changes — Terraform plan is your unit test, apply to staging is your integration test
Tools for GitOps & Infrastructure as Code
- ArgoCD (Open Source): K8s GitOps, multi-cluster, UI dashboard — Scale: Medium-Enterprise
- Terraform (Open Source): Multi-cloud IaC, stateful resource management — Scale: Small-Enterprise
- Pulumi (Open Source): IaC in real programming languages (TS, Python, Go) — Scale: Small-Enterprise
- Crossplane (Open Source): K8s-native cloud resource provisioning via CRDs — Scale: Medium-Enterprise
Related to GitOps & Infrastructure as Code
CI/CD Pipeline Design, Kubernetes Architecture, Secrets Management, Compliance & Audit Logging
Kubernetes Architecture — Compute & Orchestration
Difficulty: Advanced
Key Points for Kubernetes Architecture
- Container orchestration platform that automates deployment, scaling, and management of containerized applications
- Control plane (API server, etcd, scheduler, controller manager) manages desired state; worker nodes run workloads
- Declarative model — you define desired state in manifests, controllers reconcile actual state to match
- Service abstraction provides stable networking for ephemeral pods via kube-proxy and CoreDNS
- Resource requests/limits, HPA, and VPA enable right-sizing and auto-scaling of workloads
Common Mistakes with Kubernetes Architecture
- Not setting resource requests and limits — pods get OOMKilled or starve other workloads
- Using latest tag in production — non-deterministic deployments, impossible rollbacks
- Running workloads without PodDisruptionBudgets — cluster upgrades take down all replicas simultaneously
- Ignoring namespace resource quotas — one team's runaway deployment exhausts cluster resources
- Not configuring liveness/readiness probes — Kubernetes can't distinguish healthy from unhealthy pods
Tools for Kubernetes Architecture
- EKS (Managed): AWS-native, Fargate serverless option — Scale: Medium-Enterprise
- GKE (Managed): Autopilot mode, best K8s integration — Scale: Medium-Enterprise
- k3s (Open Source): Lightweight, edge deployments, IoT — Scale: Small-Medium
- OpenShift (Commercial): Enterprise security, integrated CI/CD — Scale: Large-Enterprise
Related to Kubernetes Architecture
Container Runtime & Docker, Service Mesh, Auto-Scaling Patterns, Service Discovery & Registration, GitOps & Infrastructure as Code
Load Balancer — Networking & Traffic
Difficulty: Intermediate
Key Points for Load Balancer
- Distributes incoming traffic across multiple backend servers to prevent overload
- L4 (transport) vs L7 (application) — different layers, different capabilities
- Enables horizontal scaling, fault tolerance, and zero-downtime deployments
- Health checks automatically remove unhealthy backends from the pool
- Session affinity (sticky sessions) vs stateless backends — critical architectural decision
Common Mistakes with Load Balancer
- Using sticky sessions without understanding the failure mode — session data is lost when that backend dies
- Not configuring health check intervals correctly — too fast overwhelms backends, too slow delays failover
- Ignoring connection draining during deployments — in-flight requests get dropped
- Round robin across heterogeneous hardware — servers with different capacities get equal load
- Not accounting for keep-alive connection imbalance — long-lived connections skew distribution
Tools for Load Balancer
- AWS ALB/NLB (Managed): Cloud-native, auto-scaling integration — Scale: Small-Enterprise
- HAProxy (Open Source): High-performance L4/L7, battle-tested — Scale: Medium-Enterprise
- Envoy (Open Source): Service mesh, advanced observability — Scale: Large-Enterprise
- NGINX (Open Source): Web server + reverse proxy + LB — Scale: Small-Enterprise
Related to Load Balancer
API Gateway, CDN & Edge Computing, Auto-Scaling Patterns, DNS & Service Discovery
Message Queues & Event Streaming — Data & Storage
Difficulty: Intermediate
Key Points for Message Queues & Event Streaming
- Decouples producers from consumers — enables asynchronous processing and system resilience
- Message queues (RabbitMQ, SQS) deliver messages to one consumer; event streams (Kafka) allow multiple consumers
- At-least-once vs exactly-once vs at-most-once — delivery guarantees affect application design
- Consumer groups enable parallel processing while maintaining per-partition ordering
- Dead letter queues capture failed messages for debugging and reprocessing
Common Mistakes with Message Queues & Event Streaming
- Not making consumers idempotent — at-least-once delivery means messages can be processed twice
- Unbounded queue growth — producers outpace consumers, queue fills disk, broker crashes
- Using a message queue when a simple function call suffices — unnecessary complexity
- Not monitoring consumer lag — falling behind indicates a scaling or performance problem
- Ordering assumptions across partitions — Kafka only guarantees order within a single partition
Tools for Message Queues & Event Streaming
- Apache Kafka (Open Source): Event streaming, high throughput, replay capability — Scale: Large-Enterprise
- RabbitMQ (Open Source): Traditional messaging, routing patterns, AMQP — Scale: Small-Large
- AWS SQS/SNS (Managed): Serverless, zero ops, fan-out patterns — Scale: Small-Enterprise
- Apache Pulsar (Open Source): Multi-tenancy, tiered storage, geo-replication — Scale: Large-Enterprise
Related to Message Queues & Event Streaming
Replication & Consistency, Caching Strategies, Distributed Logging, CI/CD Pipeline Design
Metrics & Monitoring — Observability & Reliability
Difficulty: Intermediate
Key Points for Metrics & Monitoring
- Three pillars of observability: metrics, logs, traces — metrics are the starting point for alerting
- USE method (Utilization, Saturation, Errors) for infrastructure; RED method (Rate, Errors, Duration) for services
- Prometheus pull-based model scrapes /metrics endpoints; push-based systems (Datadog, StatsD) receive from agents
- Cardinality explosion kills metric systems — avoid unbounded label values (user IDs, request IDs)
- SLIs, SLOs, and error budgets translate technical metrics into business reliability commitments
Common Mistakes with Metrics & Monitoring
- Too many dashboards, not enough alerts — dashboards are for investigation, alerts are for detection
- Alerting on symptoms (high CPU) instead of user impact (elevated error rate, high latency)
- Not defining SLOs before building monitoring — you don't know what 'healthy' means
- High-cardinality labels causing Prometheus OOM — each unique label combination is a time series
- Not monitoring the monitoring system — Prometheus itself needs health checks and alerts
Tools for Metrics & Monitoring
- Prometheus (Open Source): Kubernetes-native, PromQL, pull-based — Scale: Medium-Enterprise
- Datadog (Commercial): Unified observability, APM, easy setup — Scale: Small-Enterprise
- Grafana + Mimir (Open Source): Long-term storage, multi-tenant Prometheus — Scale: Large-Enterprise
- Victoria Metrics (Open Source): High-performance TSDB, PromQL-compatible — Scale: Medium-Enterprise
Related to Metrics & Monitoring
Distributed Logging, Distributed Tracing, Alerting & On-Call, Auto-Scaling Patterns
Object Storage & Data Lake — Data & Storage
Difficulty: Intermediate
Key Points for Object Storage & Data Lake
- Stores unstructured data (files, images, logs, backups) as objects with metadata in flat namespaces
- Virtually unlimited scalability — S3 stores over 200 trillion objects
- 11 nines (99.999999999%) durability via erasure coding and cross-region replication
- Data lakes layer structured query engines (Athena, Presto, Spark) over raw object storage
- Storage tiering (hot/warm/cold/archive) optimizes cost — lifecycle policies automate transitions
Common Mistakes with Object Storage & Data Lake
- Storing small objects individually — high request overhead per object, batch into larger files
- Not enabling versioning — accidental deletes or overwrites are irrecoverable
- Ignoring storage class optimization — keeping cold data in hot tier wastes 60-80% on storage costs
- Not using multipart upload for large files — single PUT fails silently on network issues
- Flat namespace without key prefix strategy — poor prefix design causes throttling (S3 partitions by prefix)
Tools for Object Storage & Data Lake
- AWS S3 (Managed): De facto standard, broadest ecosystem integration — Scale: Small-Enterprise
- MinIO (Open Source): S3-compatible on-premise, Kubernetes-native — Scale: Medium-Enterprise
- GCS (Managed): BigQuery integration, strong consistency — Scale: Small-Enterprise
- Azure Blob Storage (Managed): Azure ecosystem, ADLS Gen2 for analytics — Scale: Small-Enterprise
Related to Object Storage & Data Lake
Database Sharding, Distributed Logging, Compliance & Audit Logging, CI/CD Pipeline Design
Rate Limiting & Throttling — Networking & Traffic
Difficulty: Intermediate
Key Points for Rate Limiting & Throttling
- Protects services from abuse, DDoS, and resource exhaustion by capping request rates
- Token bucket, sliding window, and leaky bucket are the three core algorithms
- Distributed rate limiting requires shared state (Redis) — local-only limits are per-instance
- Rate limits should be communicated via standard headers (X-RateLimit-Limit, Remaining, Reset)
- Differentiate limits by tier — free users vs paid vs internal services
Common Mistakes with Rate Limiting & Throttling
- Per-instance rate limiting instead of global — clients can bypass by hitting different instances
- Not distinguishing between authenticated and unauthenticated rate limits
- Setting limits too tight during initial launch — legitimate traffic gets throttled
- Not returning Retry-After header — clients retry immediately, creating a thundering herd
- Rate limiting by IP only — shared IPs (NAT, corporate proxies) penalize all users behind them
Tools for Rate Limiting & Throttling
- Redis + Lua (Open Source): Distributed counters, atomic operations — Scale: Medium-Enterprise
- Envoy Rate Limit (Open Source): Service mesh integration, per-route limits — Scale: Large-Enterprise
- Kong Rate Limiting (Open Source): API gateway plugin, Redis-backed — Scale: Medium-Enterprise
- AWS WAF (Managed): Edge rate limiting, IP-based rules — Scale: Small-Enterprise
Related to Rate Limiting & Throttling
API Gateway, Load Balancer, Caching Strategies
Replication & Consistency — Data & Storage
Difficulty: Advanced
Key Points for Replication & Consistency
- Copies data across multiple nodes for fault tolerance and read scalability
- CAP theorem constrains choices — partition-tolerant systems must choose between consistency and availability
- Synchronous replication guarantees consistency but increases write latency
- Eventual consistency is acceptable for most read-heavy workloads when designed correctly
- Consensus protocols (Raft, Paxos) enable leader election and consistent replicated state machines
Common Mistakes with Replication & Consistency
- Reading from async replicas for data that was just written — read-your-own-writes violation
- Assuming eventual consistency means 'eventually correct' — conflicts must be resolved explicitly
- Not monitoring replication lag — stale reads cause subtle application bugs
- Using synchronous replication across data centers — latency kills write throughput
- Not testing failover procedures — promotion of replica to primary should be rehearsed regularly
Tools for Replication & Consistency
- PostgreSQL (Open Source): Streaming replication, synchronous commit options — Scale: Small-Enterprise
- CockroachDB (Open Source): Raft-based automatic replication, strong consistency — Scale: Medium-Enterprise
- Cassandra (Open Source): Tunable consistency, multi-DC replication — Scale: Large-Enterprise
- TiDB (Open Source): MySQL-compatible, Raft-based, HTAP — Scale: Large-Enterprise
Related to Replication & Consistency
Database Sharding, Message Queues & Event Streaming, Kubernetes Architecture
Secrets Management — Security & Governance
Difficulty: Intermediate
Key Points for Secrets Management
- Centralized secret storage with encryption at rest and fine-grained access control
- Secrets should never be in code, environment variables, or container images
- Dynamic secrets (generated on-demand with TTL) are more secure than static credentials
- Automatic rotation reduces the blast radius of compromised credentials
- Audit logging tracks who accessed which secret and when — essential for compliance
Common Mistakes with Secrets Management
- Storing secrets in environment variables — they leak into logs, crash dumps, and child processes
- Committing secrets to git — even if deleted, they persist in git history forever
- Not rotating secrets after an employee leaves or a breach is suspected
- Using the same credentials across environments (dev/staging/prod)
- Not encrypting secrets at rest in the secret store — defense in depth applies here too
Tools for Secrets Management
- HashiCorp Vault (Open Source): Dynamic secrets, PKI, transit encryption — Scale: Medium-Enterprise
- AWS Secrets Manager (Managed): AWS-native, automatic RDS rotation — Scale: Small-Enterprise
- External Secrets Operator (Open Source): Sync cloud secrets into K8s Secrets — Scale: Medium-Enterprise
- CyberArk Conjur (Commercial): Enterprise PAM, compliance-focused — Scale: Enterprise
Related to Secrets Management
Zero Trust & Network Security, Compliance & Audit Logging, Container Runtime & Docker, GitOps & Infrastructure as Code
Serverless & FaaS — Compute & Orchestration
Difficulty: Intermediate
Key Points for Serverless & FaaS
- Execute code without managing servers — cloud provider handles provisioning, scaling, and patching
- Pay-per-invocation model — zero cost at zero traffic, but expensive at sustained high throughput
- Cold start latency (100ms-10s) is the primary trade-off — mitigated by provisioned concurrency
- Best for event-driven, bursty, short-duration workloads — not for long-running processes
- Vendor lock-in is significant — Lambda, Cloud Functions, and Azure Functions have different APIs and limits
Common Mistakes with Serverless & FaaS
- Using serverless for latency-sensitive synchronous APIs without provisioned concurrency
- Not accounting for concurrent execution limits — hitting the limit causes throttling
- Monolithic functions that do too much — each function should do one thing
- Ignoring cold start impact on P99 latency — 1% of requests may have 10x normal latency
- Not setting timeout and memory limits — runaway functions accumulate cost
Tools for Serverless & FaaS
- AWS Lambda (Managed): Broadest event source integration, mature ecosystem — Scale: Small-Enterprise
- Cloudflare Workers (Managed): Edge execution, V8 isolates, sub-ms cold start — Scale: Small-Enterprise
- Google Cloud Functions (Managed): GCP integration, Cloud Run for containers — Scale: Small-Enterprise
- Knative (Open Source): Serverless on Kubernetes, no vendor lock-in — Scale: Medium-Enterprise
Related to Serverless & FaaS
API Gateway, Auto-Scaling Patterns, CI/CD Pipeline Design, Metrics & Monitoring
Service Discovery & Registration — Compute & Orchestration
Difficulty: Intermediate
Key Points for Service Discovery & Registration
- Enables services to find and communicate with each other dynamically in ephemeral infrastructure
- Self-registration vs third-party registration — services register themselves or an external agent does it
- Client-side discovery gives callers control over load balancing; server-side is simpler for callers
- Health-aware routing removes unhealthy instances automatically from the discovery pool
- In Kubernetes, Services and Endpoints are the built-in discovery mechanism via kube-proxy and CoreDNS
Common Mistakes with Service Discovery & Registration
- Hardcoding service endpoints in configuration — breaks when instances change
- Not implementing graceful deregistration — clients route to terminated instances
- Ignoring DNS TTL caching — stale DNS records cause requests to dead instances
- Not distinguishing between liveness and readiness — a starting service shouldn't receive traffic
- Single registry instance — the registry itself becomes a single point of failure
Tools for Service Discovery & Registration
- Kubernetes Services (Open Source): Built-in K8s discovery, ClusterIP/NodePort/LoadBalancer — Scale: Medium-Enterprise
- Consul (Open Source): Multi-platform, health checks, KV config — Scale: Medium-Enterprise
- Eureka (Open Source): Spring Cloud ecosystem, AP semantics — Scale: Medium-Large
- Nacos (Open Source): Service discovery + config management, popular in Java — Scale: Medium-Large
Related to Service Discovery & Registration
DNS & Service Discovery, Kubernetes Architecture, Service Mesh, Load Balancer
Service Mesh — Networking & Traffic
Difficulty: Advanced
Key Points for Service Mesh
- Dedicated infrastructure layer for service-to-service communication via sidecar proxies
- Handles mTLS, retries, circuit breaking, and observability without application code changes
- Data plane (Envoy sidecars) handles traffic; control plane (Istiod) manages configuration
- Adds ~2-5ms latency per hop due to sidecar proxy — evaluate if the trade-off is worth it
- Most valuable at 50+ services — premature adoption adds complexity without proportional benefit
Common Mistakes with Service Mesh
- Adopting a service mesh with fewer than 20 services — the operational overhead exceeds the benefit
- Not accounting for the added memory and CPU per pod (Envoy sidecar uses 50-100MB RAM)
- Assuming the mesh handles all security — it provides transport encryption, not application-level authz
- Not tuning retry policies — default retries can amplify failures (retry storm)
- Ignoring the control plane as a critical dependency — if Istiod is down, new proxy configs can't be pushed
Tools for Service Mesh
- Istio (Open Source): Full-featured mesh, Envoy-based, large community — Scale: Large-Enterprise
- Linkerd (Open Source): Lightweight, Rust proxy, simpler operations — Scale: Medium-Large
- Consul Connect (Open Source): Multi-platform (K8s + VMs), integrated service discovery — Scale: Medium-Enterprise
- AWS App Mesh (Managed): AWS-native, ECS/EKS integration — Scale: Medium-Enterprise
Related to Service Mesh
API Gateway, Load Balancer, Kubernetes Architecture, Distributed Tracing, Zero Trust & Network Security
Zero Trust & Network Security — Security & Governance
Difficulty: Advanced
Key Points for Zero Trust & Network Security
- Never trust, always verify — every request is authenticated and authorized regardless of network location
- Micro-segmentation replaces network perimeter — each service-to-service call is policy-controlled
- mTLS provides both encryption and identity verification between services
- Network policies in Kubernetes restrict pod-to-pod communication at the CNI level
- Defense in depth — combine network policies, service mesh mTLS, and application-level authz
Common Mistakes with Zero Trust & Network Security
- Relying solely on network perimeter security — once inside, attackers move laterally
- Implementing mTLS without certificate rotation — expired certs cause outages
- Overly permissive network policies — 'allow all' defaults defeat the purpose
- Not auditing policy changes — security policies need the same review process as code
- Ignoring east-west traffic — most attacks exploit service-to-service communication, not north-south
Tools for Zero Trust & Network Security
- Istio (Open Source): mTLS, authorization policies, service identity (SPIFFE) — Scale: Large-Enterprise
- Calico (Open Source): K8s network policies, eBPF dataplane — Scale: Medium-Enterprise
- Open Policy Agent (Open Source): Policy-as-code, admission control, authz decisions — Scale: Medium-Enterprise
- Cilium (Open Source): eBPF networking, L7 visibility, network policies — Scale: Medium-Enterprise
Related to Zero Trust & Network Security
Service Mesh, Secrets Management, API Gateway, Compliance & Audit Logging