Kubernetes Architecture
Architecture Diagram
Why It Exists
Running a few containers on a single box is easy. Running thousands across hundreds of nodes is a completely different problem. Something needs to handle failures, scaling, networking, storage, and rolling deployments without someone babysitting it at 3am. Kubernetes grew out of Google's internal Borg system to solve exactly this: a declarative, self-healing platform where teams describe what they want running and the system figures out how.
With fewer than 10 services, Kubernetes might not be needed at all. Docker Compose or a simple process manager might be enough. But past that threshold, with multi-node scheduling, service discovery, automated rollbacks, and proper resource isolation, K8s earns its complexity.
The standout aspect isn't any single feature. It's the extensibility model. CRDs, operators, admission webhooks, and scheduler plugins make it possible to build an entire platform on top of Kubernetes. That's why it won the orchestration wars and why the ecosystem around it keeps growing.
How It Works
Control Plane
The API Server (kube-apiserver) is the front door for every cluster operation. Every kubectl command, every controller reconciliation loop, every kubelet heartbeat goes through it as a REST call. It validates requests, runs them through admission controllers, persists state to etcd, and acts as the central hub. If the API server goes down, no changes can be made to the cluster. Period. Existing workloads keep running because kubelets operate independently, but nothing new gets scheduled and no changes get processed.
The API server also handles authentication and authorization for every request. Starting with Kubernetes 1.30+, structured authentication can be configured using CEL-based policies, enabling fine-grained access rules without writing webhook code. All requests pass through an authentication chain (client certs, bearer tokens, OIDC) followed by RBAC authorization before they touch any resource.
etcd is the cluster's source of truth. It's a distributed key-value store running the Raft consensus protocol. It needs a quorum of (n/2)+1 nodes, so production clusters run 3 or 5 etcd members. All cluster state lives here: pod specs, configmaps, secrets, RBAC rules, CRD instances, lease objects, everything. Lose etcd without backups and the cluster is gone. I've seen this happen. Back up etcd.
The Scheduler assigns unscheduled pods to nodes in two phases. First, filtering: which nodes meet hard constraints like resource availability, taints and tolerations, topology spread constraints, and affinity rules. Then scoring: ranking the eligible nodes by things like spreading across failure domains, image locality, and resource balance. The scheduler framework (introduced in 1.19 and mature by now) supports plugins that hook into both phases. Most teams never need custom scheduler plugins, but they're there for workloads like ML training jobs that need topology-aware GPU placement.
The Controller Manager runs dozens of reconciliation loops. The ReplicaSet controller checks that the right number of pods exist. The Deployment controller manages rollouts and rollbacks. The Job controller tracks completions. The StatefulSet controller handles ordered pod creation with stable identities. The CronJob controller triggers time-based workloads. Each controller watches API server events and takes corrective action to bring actual state in line with desired state. This reconciliation model is the core idea behind Kubernetes. Everything else, including the operator pattern, builds on top of it.
The Cloud Controller Manager (CCM) bridges Kubernetes with the cloud provider's APIs. It handles node lifecycle (detecting when a cloud VM gets terminated), route configuration for pod networking, and provisioning cloud load balancers for Services of type LoadBalancer. On managed platforms like EKS or GKE, the cloud provider runs this automatically. On bare metal, skip it entirely or use MetalLB for load balancer functionality.
Worker Nodes
The kubelet on each node is the agent that receives pod specs from the API server and makes sure containers are running. It talks to the container runtime through the Container Runtime Interface (CRI), a gRPC-based abstraction layer. containerd is the default runtime on nearly every distribution now. CRI-O is the alternative, popular in OpenShift environments. Anyone somehow still running Docker as a CRI is on borrowed time. Dockershim was removed in 1.24.
The kubelet also handles volume mounting through the Container Storage Interface (CSI), runs liveness, readiness, and startup probes to check container health, reports node status and resource capacity back to the API server, and manages native sidecar containers (GA since 1.29). Native sidecars are init containers with restartPolicy: Always that start before and outlive the main containers, which solves the long-standing problem of sidecar lifecycle ordering.
kube-proxy implements the Services networking abstraction. It programs iptables or IPVS rules so traffic to a Service's ClusterIP gets load-balanced across the backing pods. In IPVS mode, better algorithms are available (least connections, weighted round robin) and it handles 10,000+ services without performance degradation. For more than a few hundred services, switch to IPVS mode. Some CNI plugins like Cilium can replace kube-proxy entirely by handling service routing in eBPF, which is even faster.
Networking Model
Kubernetes mandates a flat network: every pod gets a unique IP, and any pod can reach any other pod without NAT. CNI plugins implement this model. The major players:
- Calico uses BGP for routing and supports network policies for microsegmentation. Well-documented, widely deployed, and battle-tested at scale. This is the safe default for most clusters.
- Cilium uses eBPF for kernel-level packet processing, skipping iptables overhead entirely. It also provides advanced features like transparent encryption (WireGuard), DNS-aware network policies, and service mesh capabilities without sidecars. More powerful than Calico but adds operational complexity.
- Flannel is the simplest option. VXLAN overlay, minimal config. Fine for dev clusters and small production setups, but it doesn't support network policies natively.
Pick Calico unless there's a specific reason to choose something else. For eBPF performance, L7 visibility, or replacing kube-proxy entirely, go with Cilium. Flannel works for teams that just need something running and plan to add network policies later (they won't, but I'm not here to judge).
Storage Architecture (CSI)
The Container Storage Interface decouples storage provisioning from Kubernetes core. CSI drivers run as pods in the cluster and handle volume creation, attachment, snapshots, and expansion for specific storage backends. EBS CSI driver for AWS, PD CSI driver for GCP, Ceph CSI for on-prem.
StorageClasses define storage tiers. Examples: fast-ssd (gp3 EBS with 3000 IOPS), standard (gp2), and archival (sc1). Set a default StorageClass so developers don't have to think about it for routine workloads. PersistentVolumeClaims (PVCs) request storage of a given class and size. VolumeSnapshots enable point-in-time backups.
For stateful workloads (databases, message queues), pay attention to the reclaimPolicy. Delete is the default and it nukes the underlying volume when the PVC is deleted. Set it to Retain for anything worth keeping.
Gateway API and Ingress
Gateway API is the successor to the Ingress resource. It reached GA in Kubernetes 1.26 and by now (1.32+) it's the recommended way to handle traffic routing into the cluster. The old Ingress resource still works but doesn't get new features.
Why Gateway API won: Ingress tried to be one resource for everything and ended up being one resource that was good at nothing. Every controller added its own annotations for TLS configuration, path matching, header routing, and traffic splitting. The result was a mess of vendor-specific annotations that weren't portable.
Gateway API fixes this with typed, role-oriented resources:
- GatewayClass: Defines which controller handles the gateway (like a StorageClass for networking).
- Gateway: The actual infrastructure (load balancer, IP address, TLS listeners). Platform teams own this.
- HTTPRoute / GRPCRoute / TCPRoute / TLSRoute: Application developers attach routes to gateways. Each route type has purpose-built fields instead of shoehorning everything into annotations.
This separation matters. Platform teams control the gateway infrastructure and security policies. Application teams attach their routes without needing cluster-admin access. Header-based routing, traffic splitting for canaries, request mirroring, and cross-namespace references all work out of the box. No annotations needed.
Controllers: Envoy Gateway, Contour, Traefik, and the cloud-native implementations (AWS Gateway API Controller, GKE Gateway Controller) all support it. For any new cluster in 2026, use Gateway API from day one.
The Operator Pattern
This is where Kubernetes goes from a container scheduler to a full application platform. And it's the piece most introductions skip or gloss over.
An operator is a custom controller paired with one or more Custom Resource Definitions (CRDs). The controller watches for changes to those custom resources and takes action to reconcile actual state with desired state. Same reconciliation loop that powers built-in controllers like Deployments and ReplicaSets, but for domain-specific logic.
Why operators matter
Think about running PostgreSQL on Kubernetes without an operator. Someone needs to handle primary/replica topology, failover, backups, connection pooling, certificate rotation, and version upgrades. Someone has to write scripts for all of that and wire them up to Kubernetes lifecycle hooks. It's fragile, hard to test, and every team reinvents the wheel.
With a PostgreSQL operator (CloudNativePG, Zalando's postgres-operator, CrunchyData's PGO), a YAML manifest says "I want a 3-node PostgreSQL 16 cluster with daily backups to S3." The operator handles the rest. It creates the StatefulSet, configures replication, sets up backup schedules, manages failover, and handles minor version upgrades. When a replica falls behind, the operator detects it and rebuilds. When the version is bumped in the manifest, the operator orchestrates a rolling upgrade with proper drain and fence procedures.
Real-world operators worth knowing
- cert-manager: Automates TLS certificate lifecycle. Integrates with Let's Encrypt, Vault, AWS ACM. Most Kubernetes production environments use this.
- Prometheus Operator: Defines monitoring targets as CRDs (ServiceMonitor, PodMonitor). Makes Prometheus configuration declarative instead of editing config files.
- Karpenter: Node provisioning operator for AWS (and increasingly other clouds). Watches for unschedulable pods and spins up right-sized nodes in seconds. Replaces the Cluster Autoscaler for most EKS setups and does a better job because it can provision heterogeneous instance types.
- ArgoCD / Flux: GitOps operators that sync cluster state from a git repository. Changes go through pull requests, get reviewed, and the operator applies them.
- Crossplane: Provisions cloud infrastructure (RDS, S3, VPCs) through Kubernetes CRDs. The platform team defines compositions, and developers request resources without touching cloud consoles.
Building a custom operator
Use kubebuilder or the Operator SDK (which wraps kubebuilder). They scaffold the project structure, generate CRD manifests from Go types, and provide a controller runtime with built-in leader election, caching, and event handling.
Rules of thumb for custom operators:
- Rate limit the reconcile loop. An operator that hammers the API server on every event will destabilize the cluster. Use exponential backoff and requeue intervals.
- Make reconciliation idempotent. The controller might run the same reconcile function multiple times for the same event. Every operation should be safe to repeat.
- Use finalizers for cleanup. If the operator creates external resources (cloud infra, DNS records), add a finalizer so those resources get cleaned up when the CR is deleted.
- Test against a real cluster. Unit tests with fake clients don't catch the issues that show up under real API server latency and concurrent access. Use envtest or kind for integration tests.
- Load test before production. Create 1,000+ instances of the CRD and watch what happens. A controller that can't handle it will fail during a scaling event at the worst possible time.
Node Fingerprinting
The scheduler can only place workloads intelligently if it knows what each node is capable of. That's where fingerprinting comes in.
What it is
Node fingerprinting is the process of discovering and reporting hardware and software capabilities of each node so the scheduler and operators can make informed placement decisions. Without it, the scheduler treats every node as identical. That's fine when nodes are homogeneous commodity VMs. It falls apart the moment GPUs, specialized CPUs, NVMe storage, or different processor architectures enter the picture.
How kubelet does it
The kubelet automatically discovers and reports basic node characteristics as labels and capacity values:
- Architecture:
kubernetes.io/arch(amd64, arm64) - Operating system:
kubernetes.io/os(linux, windows) - Instance type:
node.kubernetes.io/instance-type(m5.xlarge, etc.) - Topology:
topology.kubernetes.io/zone,topology.kubernetes.io/region - Allocatable resources: CPU, memory, ephemeral storage, and hugepages
This covers the basics but misses hardware-specific features.
Node Feature Discovery (NFD)
NFD is a Kubernetes SIG project that runs as a DaemonSet and performs deep hardware fingerprinting. It detects:
- CPU features: AVX-512, AES-NI, SGX, AMX (matters for ML inference and cryptographic workloads)
- GPU presence and model: NVIDIA A100 vs H100, AMD MI300X
- Storage type: NVMe vs SATA, local SSD presence
- Network capabilities: SR-IOV support, RDMA, DPDK compatibility
- Kernel features: Loaded modules, kernel version, security features
- Memory: NUMA topology, persistent memory (PMEM)
NFD publishes these as node labels (feature.node.kubernetes.io/cpu-cpuid.AVX512F: true). Then use nodeSelector or nodeAffinity in pod specs to target nodes with specific capabilities.
Why this matters in practice
A concrete example: an ML inference service needs GPU acceleration and AVX-512 for CPU fallback. Without fingerprinting, there are two options. Manual labeling (error-prone and doesn't scale) or scheduling blindly and hoping for the best (pods land on nodes without GPUs and fail).
With NFD plus the NVIDIA device plugin, the scheduler sees exactly which nodes have H100 GPUs with 80GB VRAM, which have AVX-512, and which have fast local NVMe for model caching. The pod spec says "I need 1 GPU and prefer a node with NVMe storage" and the scheduler handles the rest.
For heterogeneous clusters (mixed CPU architectures, GPU and non-GPU nodes, different storage tiers), fingerprinting isn't optional. It's the foundation that makes intelligent scheduling possible.
Production Considerations
- Resource management. Always set
requests(guaranteed allocation used by the scheduler) andlimits(hard ceiling enforced by cgroups). UseLimitRangeandResourceQuotaobjects to enforce guardrails at the namespace level. Teams that skip this regret it within weeks when one workload starves everything else. - Pod Security Admission. PodSecurityPolicy is gone (removed in 1.25). Pod Security Admission (PSA) is the replacement. Enforce the
restrictedprofile in production namespaces. It blocks privileged containers, host networking, hostPath mounts, and running as root. Apply it with namespace labels:pod-security.kubernetes.io/enforce: restricted. - Admission policies with CEL. ValidatingAdmissionPolicy (GA in 1.30) enables writing policy rules in CEL expressions directly, without deploying webhook servers. Use this for things like "all containers must use image digests" or "no Service of type LoadBalancer in dev namespaces." Simpler to operate than OPA Gatekeeper or Kyverno for straightforward policies.
- etcd backup. Snapshot etcd regularly with
etcdctl snapshot save. Store snapshots off-cluster in object storage. Test the restoration procedure before it's needed, not during an outage at 2am. - Control plane HA. Run 3+ API server replicas behind a load balancer, 3+ etcd members across availability zones, and leader-elected scheduler/controller-manager pairs. A control plane that is a single point of failure is not production-grade Kubernetes.
- Pod Disruption Budgets. Define
minAvailableormaxUnavailableso that voluntary disruptions (node drains, cluster upgrades) can't take down all replicas at once. This is trivial to set up and prevents entirely avoidable outages. - Security hardening. Enable RBAC with least-privilege roles. Use network policies to deny all traffic by default and allow only what's needed. Rotate service account tokens (bound tokens expire automatically since 1.22). Scan images with Trivy or Grype in CI. Sign images with cosign and verify signatures with a ValidatingAdmissionPolicy or Kyverno. Most clusters I've audited are way too permissive out of the box.
- In-place pod resource resizing. Kubernetes 1.33 supports in-place resize for CPU and memory resources without restarting the pod. This is useful for Java workloads where increasing heap during a traffic spike is needed without losing existing connections. Still relatively new, so test thoroughly before relying on it for stateful workloads.
Failure Scenarios
Scenario 1: etcd Quorum Loss. In a 3-node etcd cluster, two members crash at the same time (correlated AZ failure, for example). The API server goes read-only. No new pods get scheduled, no deployments roll out, HPA can't scale, and operators can't reconcile. Existing workloads keep running because kubelet operates independently, but any pod that crashes won't get rescheduled. Detection: watch etcd_server_has_leader (drops to 0) and etcd_server_proposals_failed_total (spikes). Recovery: restore from snapshot to a new quorum. Running 5-member etcd clusters across 3 AZs enables surviving any single-AZ outage. Target: keep etcd write latency P99 under 25ms. Above that, expect cluster instability. Use dedicated SSD-backed disks for etcd data. Sharing disks with other workloads is asking for trouble.
Scenario 2: Operator-Induced Control Plane Overload. A misconfigured custom operator enters a tight reconcile loop and starts making thousands of API server calls per second. Maybe a status update triggers a watch event, which triggers another reconcile, which updates status again. The API server's request queue saturates, etcd write throughput gets exhausted, and all controllers (including core ones like Deployment and ReplicaSet) stall. New pods stop getting scheduled. HPA stops working. The cluster effectively freezes even though all nodes are healthy. Detection: apiserver_request_total rate spikes, apiserver_request_duration_seconds P99 exceeds 1s, etcd_disk_wal_fsync_duration_seconds climbs. Check apiserver_request_total grouped by user-agent to identify the offending controller. Recovery: scale the operator deployment to 0 replicas immediately. Apply API Priority and Fairness (APF) flow schemas to throttle the offending service account. Prevention: enforce rate limiting in operator code with client-go rate limiters. Set MaxConcurrentReconciles appropriately. Use APF catch-all rules to protect the system from any single controller consuming more than 30% of API server capacity. Load test every operator against a staging cluster before deploying to production.
Scenario 3: CNI Plugin Crash on Rolling Update. A Cilium DaemonSet upgrade causes the agent to crash-loop on 30% of nodes because of a kernel version incompatibility. Pods on affected nodes lose all network connectivity. New pods schedule to those nodes but can't reach anything. The failure is partial, which makes it harder to detect than a full outage. Some requests work, some time out, and load balancers keep sending traffic to broken nodes because kubelet still reports them as ready (kubelet doesn't know the network layer is down). Detection: kube_pod_status_phase{phase="Running"} drops, application 5xx rates spike inconsistently, cilium_unreachable_nodes rises. Recovery: pause the DaemonSet rollout (kubectl rollout undo), cordon affected nodes, drain workloads. Prevention: always use maxUnavailable: 1 on DaemonSet updates and canary to a single node first. Test CNI upgrades against every kernel version in the fleet. I've been burned by this exact scenario. Never again.
Capacity Planning
| Component | Recommended Threshold | Real-World Reference |
|---|---|---|
| etcd cluster size | 3 (small), 5 (large); never even numbers | Google Borg: sharded etcd equivalent for 10K+ node clusters |
| etcd storage | < 8GB DB size; compact and defrag regularly | Shopify: 6GB etcd at 12K pods, automated compaction every 5 min |
| API server QPS | < 500 sustained per instance; add replicas beyond | Airbnb: 4 API server replicas for ~5K node clusters |
| Pods per node | 110 default (kubelet), 250 max with tuning | Pinterest: 200 pods/node with optimized IP allocation via Calico |
| Nodes per cluster | < 5,000 (upstream tested); practically < 2,000 | PayPal: 4,000-node clusters; Spotify prefers many smaller clusters (~200 nodes) |
| Pod scheduling rate | ~50 pods/sec default scheduler; higher with scheduler plugins | Scheduling throughput scales linearly with API server replicas |
| CRD instances per type | < 50,000 without custom indexing; watch costs grow linearly | Large operator deployments frequently hit this ceiling around 20K |
Key formulas: etcd IOPS = cluster_events_per_sec * 2 (writes get journaled then compacted). API server memory = 1GB base + (50MB * thousands_of_pods). Operator memory = base + (CRD_instance_count * per_instance_cache_size). For operators caching thousands of resources, memory use can grow faster than expected.
Past 2,000 nodes, strongly consider multi-cluster with Cluster API or vcluster for tenant isolation instead of pushing a single control plane further. Scaling one cluster to 5K nodes is technically possible but operationally painful. Spotify, Shopify, and several other large-scale operators have publicly shared that they prefer federations of smaller clusters over single massive ones. The blast radius of a control plane issue in a 5K-node cluster is far worse than in a 200-node cluster.
For Karpenter users: Karpenter provisions nodes in 30-60 seconds compared to 2-5 minutes for the Cluster Autoscaler. It also consolidates underutilized nodes by moving workloads and terminating wasted capacity. At scale, this saves 20-35% on compute costs compared to static node groups.
Architecture Decision Record
ADR: Kubernetes Deployment Strategy, Managed vs Self-Hosted
Context: Choosing between managed Kubernetes (EKS, GKE, AKS), lightweight distributions (k3s, Talos), or fully self-hosted.
| Criteria (Weight) | EKS/GKE/AKS (Managed) | k3s (Lightweight) | Talos Linux (Immutable) | Self-Hosted (kubeadm) |
|---|---|---|---|---|
| Operational overhead (25%) | Low, control plane managed | Medium, single binary but team owns upgrades | Low-Medium, API-driven node mgmt, no SSH | High, team owns everything |
| Cost at < 50 nodes (20%) | $73/mo (EKS) + node cost; GKE free tier for 1 zonal cluster | Nodes only | Nodes only | Nodes only |
| Cost at 500+ nodes (20%) | $73/mo (EKS), negligible at scale | Nodes only, but ops cost rises | Nodes only, minimal ops | Significant ops team cost |
| Upgrade reliability (15%) | Managed blue-green control plane | Manual but fast (single binary swap) | Automated rolling OS + K8s upgrades | Manual, risky, requires etcd backup |
| Customization depth (10%) | Limited (no custom scheduler, no etcd access on most) | Full (standard K8s, just lightweight) | Full K8s, immutable OS limits host customization | Full control over every component |
| Security posture (10%) | Good defaults, shared responsibility | Standard, team handles hardening | Excellent (immutable, no shell, API-only, minimal attack surface) | Full responsibility |
Decision guidance:
For teams under 10 engineers running cloud-native workloads, just use managed Kubernetes. The $73/month fee pays for itself in reduced on-call burden within the first week. GKE Autopilot is worth evaluating because Google manages both the control plane and the node pool, which means even less to think about.
For edge, IoT, or resource-constrained environments, evaluate k3s. It runs on a Raspberry Pi and packs the full Kubernetes API into a single binary under 100MB.
For teams that care deeply about security and want an immutable infrastructure approach, Talos Linux is compelling. Every node is managed through an API (no SSH, no shell access). The OS is read-only and purpose-built to run Kubernetes. Upgrades are atomic. The attack surface is minimal. It's gaining serious traction in regulated environments (finance, healthcare) where auditability and immutability matter.
Self-hosted with kubeadm only makes sense when regulation demands full control plane ownership (financial services, defense, air-gapped networks) and dedicated platform engineers are available. Plan for a minimum of 2 FTEs to cover 24/7 operations. Anything less is setting the team up for a rough time.
For organizations running 10+ clusters, invest in Cluster API (CAPI) for lifecycle management. CAPI treats clusters as declarative resources, enabling creation, upgrading, and deletion of entire clusters through the same GitOps workflow used for application deployments.
Key Points
- •Container orchestration platform born from Google's Borg. Automates deployment, scaling, self-healing, and rollback for containerized workloads.
- •The control plane (API server, etcd, scheduler, controller manager, cloud controller manager) tracks desired state. Worker nodes run actual workloads through kubelet and a CRI-compatible container runtime.
- •Operators extend Kubernetes by encoding operational knowledge into custom controllers paired with CRDs. They turn complex stateful apps into first-class platform citizens.
- •Node fingerprinting (via kubelet and NFD) discovers hardware capabilities like GPUs, CPU instruction sets, and NVMe disks so the scheduler places workloads on the right machines.
- •Gateway API has replaced Ingress as the standard for traffic routing. It provides typed routes (HTTP, gRPC, TCP, TLS) with role-oriented resource ownership.
- •Resource requests and limits, HPA, VPA, and Karpenter enable right-sizing workloads and auto-scaling nodes in response to real demand.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| EKS | Managed | AWS-native, Fargate serverless pods, deep IAM integration | Medium-Enterprise |
| GKE | Managed | Autopilot mode, fastest upstream K8s releases, best managed experience | Medium-Enterprise |
| AKS | Managed | Azure-native, KEDA built-in, good Windows container support | Medium-Enterprise |
| k3s | Open Source | Lightweight single binary, edge and IoT, ARM support | Small-Medium |
| Talos Linux | Open Source | Immutable OS purpose-built for K8s, API-managed nodes, no SSH | Medium-Enterprise |
| OpenShift | Commercial | Enterprise compliance, integrated CI/CD, developer portal | Large-Enterprise |
Common Mistakes
- Skipping resource requests and limits. Pods will get OOMKilled or starve neighbors. Every production pod needs both.
- Using the latest tag in production. Deterministic deploys are lost and rollbacks become a guessing game. Pin to digests or immutable tags.
- Running workloads without PodDisruptionBudgets. Cluster upgrades and node drains will nuke all replicas at once.
- Ignoring namespace resource quotas. One team's runaway deployment eats the whole cluster budget.
- Not configuring liveness and readiness probes. Without them, Kubernetes routes traffic to broken pods and never restarts stuck containers.
- Writing operators without rate limiting on the reconcile loop. A tight loop hammering the API server can destabilize the entire control plane.
- Treating Ingress as the long-term routing solution. Gateway API is the standard now. New clusters should start with it.
- Skipping node fingerprinting for GPU or specialized workloads. Without proper labels, the scheduler has no way to match workloads to hardware capabilities.