AI Infrastructure Cost Management — Platform Operations
Difficulty: Expert. Maturity Level: Optimizing.
Key Points for AI Infrastructure Cost Management
- GPU costs run 5-10x higher than equivalent CPU instances, making right-sizing the single highest-leverage cost optimization for AI workloads
- LLM API costs scale linearly with usage, and what looks like $500/month during a beta can easily become $50K/month at production scale if you are not tracking per-request costs from day one
- Implement chargeback for AI compute so product teams see their actual costs. Teams that see their GPU bill make very different architectural decisions than teams with a blank check.
- GPU scheduling and cluster-level orchestration improves utilization from the typical 15-30% range to 60-80%, often saving more than any single model optimization
- Smaller fine-tuned models deliver 90% of the quality at 10% of the cost for most production use cases. The largest model is rarely the right model.
Common Mistakes with AI Infrastructure Cost Management
- Treating GPU instances like CPU instances and leaving them running 24/7 for workloads that only need them during business hours or training runs
- No per-request cost tracking for LLM API calls. Without knowing what each call costs, you cannot optimize prompts, detect runaway loops, or forecast budgets.
- Defaulting to the largest available model for every use case. GPT-4 class models cost 30-60x more per token than GPT-3.5 class models, and most tasks do not need the extra capability.
- Ignoring inference costs during model design. A model that is 2% more accurate but requires 4x the compute for inference is usually a bad trade in production.
Related to AI Infrastructure Cost Management
Platform Team Operating Model, Self-Service Infrastructure
CI/CD Pipeline Standardization — Developer Platforms
Difficulty: Advanced. Maturity Level: Foundation.
Key Points for CI/CD Pipeline Standardization
- Shared pipeline templates reduce per-team maintenance from 2-4 hours per week to near zero while keeping builds consistent
- Build caching (Docker layer caching, dependency caching, test result caching) typically cuts CI time by 40-60%
- Security scanning must be in the pipeline, not a separate process. Shift-left means SAST and SCA run on every pull request
- Pipeline execution time is a developer productivity metric. Every minute added to CI costs engineering time across every commit
- Deployment approval workflows should be automated for staging and require explicit approval only for production
Common Mistakes with CI/CD Pipeline Standardization
- Mandating a single pipeline template with no escape hatch, which forces teams with unusual requirements to work around the system
- Running full integration test suites on every commit instead of using test impact analysis to run only affected tests
- Not measuring pipeline reliability separately from test reliability, making it impossible to identify flaky infrastructure vs flaky tests
Related to CI/CD Pipeline Standardization
Golden Paths & Paved Roads, Developer Portal & Backstage
Container Runtime Platform — Infrastructure Abstraction
Difficulty: Expert. Maturity Level: Foundation.
Key Points for Container Runtime Platform
- EKS and GKE handle control plane operations but you still own node management, networking, security policies, and cluster upgrades
- Spot instances for stateless workloads cut compute costs by 60-90% but require proper pod disruption budgets and graceful shutdown handling
- Namespace isolation is sufficient for trusted teams within a company. Cluster isolation is necessary for untrusted tenants or strict compliance requirements
- Cluster autoscaler and Karpenter take fundamentally different approaches. Karpenter provisions nodes based on pod requirements while cluster autoscaler scales existing node groups
- Container image management needs a registry with vulnerability scanning, image signing, and garbage collection to prevent storage costs from spiraling
Common Mistakes with Container Runtime Platform
- Running a single large cluster for everything instead of separating production, staging, and platform infrastructure into dedicated clusters
- Not setting resource requests and limits on pods, which leads to noisy neighbor problems and nodes running out of memory
- Using latest tags for container images in production, making deployments non-reproducible and rollbacks impossible
- Skipping pod disruption budgets so cluster upgrades and node drains take down services unexpectedly
Related to Container Runtime Platform
Internal Developer Platform Design, Service Mesh Implementation
Cost Management Platform — Platform Operations
Difficulty: Advanced. Maturity Level: Optimizing.
Key Points for Cost Management Platform
- Per-team cost attribution through Kubernetes labels and cloud resource tags makes spending visible and creates accountability without blame
- Kubecost provides real-time cost allocation at the pod and namespace level, showing exactly which team and service drives each dollar of spend
- Spot instances save 60-90% on compute but require workloads designed for interruption with proper pod disruption budgets and graceful shutdown
- Show-back models (reporting costs without charging) change behavior almost as effectively as charge-back (actual internal billing) with far less organizational friction
- Cost anomaly detection catches runaway spending within hours instead of discovering a $50k surprise on the monthly bill
Common Mistakes with Cost Management Platform
- Treating cost optimization as a quarterly project instead of building automated guardrails that continuously right-size resources
- Buying reserved instances based on current usage without accounting for planned migrations, architecture changes, or workload growth
- Optimizing compute costs while ignoring data transfer, storage, and managed service costs which often represent 40-60% of the total bill
Related to Cost Management Platform
AI Cost & Infrastructure, Platform Team Operating Model
Self-Service Database Provisioning — Platform Operations
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Self-Service Database Provisioning
- Self-service database provisioning should take under 5 minutes from request to a running instance with connection string delivered
- PgBouncer in transaction pooling mode can reduce PostgreSQL connection count from 2,000 application connections to 50 database connections
- Automated daily backups with point-in-time recovery are the minimum. Test restores monthly because untested backups are not backups
- Schema migration tooling like Atlas or Flyway should be integrated into CI/CD so database changes go through the same review process as code
- Multi-tenant shared instances save 60-80% on costs but require strict resource isolation through connection limits, statement timeouts, and schema separation
Common Mistakes with Self-Service Database Provisioning
- Letting every team run their own database without centralized backup verification, monitoring, or security patching
- Using shared connection pools across all services without per-service connection limits, allowing one misbehaving service to exhaust the pool
- Skipping schema migration tooling and letting teams apply DDL changes manually in production through direct database access
- Not setting up automated alerting for disk space, replication lag, and long-running queries before they cause outages
Related to Self-Service Database Provisioning
Self-Service Infrastructure, Cost Management Platform
Developer Portal & Backstage — Developer Platforms
Difficulty: Intermediate. Maturity Level: Scaling.
Key Points for Developer Portal & Backstage
- A developer portal is the UI layer of your IDP. It provides a single pane of glass for service catalog, documentation, and self-service workflows.
- Backstage (Spotify, CNCF) is the dominant open-source option with a plugin architecture and strong community, but requires significant investment to operationalize
- Start with the software catalog (service ownership, API docs, dependencies) before adding self-service scaffolding or CI/CD visibility
- Portal adoption depends on making it the fastest way to get information. If developers still need to check Slack, Confluence, and PagerDuty separately, the portal fails.
- Plugin development is the long game. The portal becomes valuable when teams contribute domain-specific plugins (cost dashboards, security scans, compliance status).
Common Mistakes with Developer Portal & Backstage
- Deploying Backstage out of the box without customization and expecting developers to adopt it. Vanilla Backstage solves almost nothing.
- Building a portal without populating the software catalog first. An empty catalog teaches developers that the portal is useless.
- Treating the portal as an ops tool rather than a developer tool. The primary audience is application developers, not platform engineers.
- Underestimating the maintenance burden of Backstage. Upgrades, plugin compatibility, and catalog data freshness require ongoing investment.
Related to Developer Portal & Backstage
Internal Developer Platform Design, Golden Paths & Paved Roads
Developer Productivity Metrics — Productivity & Metrics
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Developer Productivity Metrics
- DORA metrics (deployment frequency, lead time, change failure rate, MTTR) measure delivery performance, not individual productivity
- SPACE framework (Satisfaction, Performance, Activity, Communication, Efficiency) provides a multidimensional view that resists gaming
- Never use metrics to compare individual developers. They are system-level indicators, not performance reviews.
- Leading indicators (CI build time, PR review latency, environment provisioning time) are more actionable than lagging indicators (deployment frequency)
- Instrument automatically from your CI/CD and SCM systems. Manual reporting introduces bias and overhead.
Common Mistakes with Developer Productivity Metrics
- Measuring lines of code, commit count, or PRs merged as productivity proxies. These incentivize the wrong behaviors.
- Publishing individual developer metrics on dashboards. This destroys psychological safety and encourages gaming.
- Treating DORA metrics as targets rather than signals. Goodhart's Law applies aggressively to developer metrics.
- Measuring only speed without quality. High deployment frequency with a high change failure rate is not productivity, it is chaos.
Related to Developer Productivity Metrics
Platform Team Operating Model, Internal Developer Platform Design
Feature Flag Platforms — Developer Platforms
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Feature Flag Platforms
- Local SDK evaluation (flags cached client-side) adds zero latency to feature checks while remote evaluation adds a network round trip per check
- LaunchDarkly handles 20+ trillion feature evaluations per day which proves the pattern scales, but self-hosted Unleash or Flagsmith can work at 90% lower cost
- Stale flags are technical debt with real cost. Teams that don't enforce flag cleanup end up with 500+ flags and nobody knows which ones are safe to remove
- Feature flags enable progressive delivery: canary releases to 1% of users, ring-based rollouts, and instant kill switches without redeployment
- Multi-environment flag management requires a promotion workflow similar to code promotion through dev, staging, and production
Common Mistakes with Feature Flag Platforms
- Using feature flags as a permanent configuration system instead of temporary release toggles with defined expiration dates
- Evaluating flags with remote API calls in hot paths which adds 10-50ms latency per evaluation without local caching
- Skipping flag ownership assignment so nobody takes responsibility for cleaning up flags after features are fully launched
Related to Feature Flag Platforms
Golden Paths & Paved Roads, Developer Productivity Metrics
GitOps Platform Patterns — Developer Platforms
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for GitOps Platform Patterns
- Git becomes the single source of truth for both infrastructure and application configuration, making every change auditable through git history
- ArgoCD supports multi-cluster deployments natively with ApplicationSets while Flux uses a more decentralized model with one Flux instance per cluster
- Drift detection and automatic reconciliation mean that manual kubectl changes get reverted within minutes, enforcing declared state
- Secret management in GitOps requires tools like Sealed Secrets, SOPS, or External Secrets Operator since raw secrets cannot live in git
- PR-based deployments give you code review for infrastructure changes, automatic preview environments, and a clear rollback path through git revert
Common Mistakes with GitOps Platform Patterns
- Storing application code and GitOps manifests in the same repository, which triggers deployments on every code commit even when manifests haven't changed
- Not implementing a promotion workflow so changes go directly from a developer's PR to production without passing through staging
- Relying on ArgoCD auto-sync for production without understanding that it will immediately apply any change merged to the target branch
Related to GitOps Platform Patterns
CI/CD Pipeline Standardization, Golden Paths & Paved Roads
Golden Paths & Paved Roads — Developer Platforms
Difficulty: Intermediate. Maturity Level: Foundation.
Key Points for Golden Paths & Paved Roads
- Golden paths make the right thing the easy thing. They are opinionated defaults, not enforced mandates.
- Start with service templates (cookiecutters/scaffolds) that include CI/CD, observability, and security out of the box
- Guardrails over gates: use policy-as-code to warn and nudge rather than hard-blocking developers
- Measure adoption rate organically. If less than 80% of new services use the golden path, improve the path rather than forcing compliance.
- Version your golden paths and support migration tooling so existing services can upgrade to new standards
Common Mistakes with Golden Paths & Paved Roads
- Making golden paths mandatory without making them better than the alternative. Developers will route around bad abstractions.
- Creating a single golden path when your organization has fundamentally different workload types (APIs vs event processors vs ML pipelines)
- Neglecting Day 2 operations. A template that creates a service but does not help with debugging, scaling, or upgrading is only half the story.
- Not involving application developers in designing the paths. Platform engineers often optimize for infrastructure elegance rather than developer ergonomics.
Related to Golden Paths & Paved Roads
Internal Developer Platform Design, Self-Service Infrastructure
Internal API Gateway — Infrastructure Abstraction
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Internal API Gateway
- Internal gateways handle east-west traffic between services while external gateways handle north-south traffic from clients
- Kong processes 100,000+ requests per second on a single node with sub-millisecond added latency for most plugin configurations
- Rate limiting at the gateway level protects downstream services from cascading failures and noisy neighbor problems
- A developer portal integrated with the gateway lets teams discover, test, and subscribe to internal APIs without filing tickets
- Schema validation at the gateway catches malformed requests before they reach your service, reducing error handling code across all backends
Common Mistakes with Internal API Gateway
- Using the same gateway for internal and external traffic without separate security policies, rate limits, and monitoring
- Adding too many plugins to the gateway which compounds latency from sub-millisecond to 10-20ms per request
- Not versioning internal APIs because they're internal, then breaking 15 services with a schema change
- Choosing a gateway based on feature count instead of operational simplicity for your team's skill level
Related to Internal API Gateway
Service Mesh Implementation, Self-Service Infrastructure
Internal Developer Platform Design — Developer Platforms
Difficulty: Advanced. Maturity Level: Foundation.
Key Points for Internal Developer Platform Design
- An IDP is a product, not a project. It requires product management, user research, and iterative delivery.
- Layer architecture: developer interface (portal/CLI), integration layer (APIs/webhooks), resource layer (infrastructure primitives)
- Build the thinnest possible platform that solves real friction. Resist building abstractions nobody asked for.
- Assemble from existing tools (Backstage, Argo, Crossplane) rather than building from scratch. Composition over creation.
- Measure success by developer adoption rate and time-to-production, not by number of features shipped
Common Mistakes with Internal Developer Platform Design
- Building an IDP as a top-down mandate without understanding actual developer pain points. Always start with user research.
- Over-abstracting infrastructure so developers lose visibility into what is actually running underneath
- Treating the platform as a one-time project rather than a continuously evolving product with a roadmap
- Forcing adoption through policy instead of making the platform genuinely easier than the alternative
Related to Internal Developer Platform Design
Golden Paths & Paved Roads, Developer Portal & Backstage
ML Platform & AI Golden Paths — Developer Platforms
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for ML Platform & AI Golden Paths
- ML golden paths cut experiment-to-production time from months to days by standardizing the boring parts: packaging, deployment, monitoring, and rollback
- The key design principle is abstracting infrastructure complexity while preserving experiment flexibility. Data scientists should never write Dockerfiles, but they should always control hyperparameters.
- Three capabilities are essential before anything else: a feature store for consistent feature computation, experiment tracking for reproducibility, and a model registry for versioning and lineage
- Self-service model deployment with built-in canary rollouts, A/B testing, and one-click rollback removes the biggest bottleneck in ML organizations
- Measure your ML platform by adoption rate and time-to-production. If data scientists are not using it, the platform is wrong, not the scientists.
Common Mistakes with ML Platform & AI Golden Paths
- Building an ML platform before you have proven ML use cases in production. You need at least 2-3 models running manually before you know what to automate.
- Over-abstracting the platform so data scientists cannot debug failures. When a training job fails, they need to see logs, not a generic 'pipeline error' message.
- Treating the ML platform as entirely separate from the application platform. Shared capabilities like CI/CD, secrets management, and observability should be reused, not rebuilt.
- Not involving data scientists in platform design. Engineers build what is elegant. Scientists need what is practical. These are often different things.
Related to ML Platform & AI Golden Paths
Golden Paths & Paved Roads, Internal Developer Platform Design
Observability Platform Design — Platform Operations
Difficulty: Expert. Maturity Level: Scaling.
Key Points for Observability Platform Design
- Two-tier instrumentation: eBPF (Grafana Beyla) provides baseline RED metrics and trace spans for every service with zero code changes, while OTel SDK adds depth for custom business metrics and profiling
- Value-based data routing in the OTel Collector pipeline drops health checks, samples debug noise, and enriches with ownership metadata before storage — reducing costs 15-30%
- Four pillars (metrics, traces, logs, profiles) with pillar-native storage: VictoriaMetrics, Tempo, VictoriaLogs, Pyroscope — each optimized for its signal's access pattern
- Profile-to-trace correlation is the key differentiator: click from a slow trace span to a CPU flame graph showing exactly which function is the bottleneck
- ML anomaly detection supplements SLO burn rates for alerting — ML goes to Slack for awareness, burn rates page via PagerDuty
Common Mistakes with Observability Platform Design
- Collecting everything at full resolution and getting a $200k/month Datadog bill before realizing that 90% of the data is never queried. Value-based routing fixes this at the pipeline level
- Treating observability as a project with an end date instead of an ongoing platform capability that evolves with the organization
- Stopping at three pillars and missing profiling — without profiles, the investigation ends at 'this service is slow' instead of 'this function is the bottleneck'
- Not correlating signals: metrics to traces to profiles to logs. Each click should deepen the investigation, not require switching tools
Related to Observability Platform Design
Developer Productivity Metrics, Platform Team Operating Model
Platform Team Operating Model — Platform Operations
Difficulty: Expert. Maturity Level: Optimizing.
Key Points for Platform Team Operating Model
- A platform team is a product team. It needs a product manager, user research, a roadmap, and regular stakeholder communication.
- Team Topologies defines the platform team as an enabling team that reduces cognitive load on stream-aligned (feature) teams
- Fund the platform team as a product investment, not as a cost center. Tie funding to developer productivity outcomes, not headcount.
- Define platform SLOs (provisioning latency, API availability, golden path adoption rate) and treat them with the same rigor as production service SLOs
- Run regular developer experience surveys and use the feedback to prioritize the roadmap. The platform team's customers are internal developers.
Common Mistakes with Platform Team Operating Model
- Staffing the platform team with only infrastructure engineers. You need product managers, developer advocates, and frontend engineers too.
- Building features nobody asked for because the platform team finds them technically interesting. Validate demand before building.
- Operating as a service team that takes requests rather than a product team that identifies and solves systemic problems
- Not having an explicit support model. Developers need to know how to get help, what the response SLAs are, and where to file feature requests.
Related to Platform Team Operating Model
Internal Developer Platform Design, Developer Productivity Metrics
Secrets Management Platform — Infrastructure Abstraction
Difficulty: Advanced. Maturity Level: Foundation.
Key Points for Secrets Management Platform
- Dynamic secrets that expire after use eliminate the risk of long-lived credentials sitting in config files for months
- HashiCorp Vault handles 10,000+ secret reads per second in production clusters but requires dedicated operational expertise to run reliably
- External Secrets Operator syncs secrets from Vault or cloud secret managers into Kubernetes, keeping your manifests free of secret references
- Secret sprawl detection tools like TruffleHog and GitGuardian catch credentials that leak into git repos, Slack messages, and CI logs
- Cloud-native secret managers (AWS Secrets Manager, GCP Secret Manager) are the right default unless you need multi-cloud or advanced dynamic secrets
Common Mistakes with Secrets Management Platform
- Storing secrets in Kubernetes native Secrets without encryption at rest, which are just base64 encoded and readable by anyone with cluster access
- Running HashiCorp Vault without understanding unsealing, HA configuration, and backup procedures, leading to outages that lock out all applications
- Rotating secrets in the secret manager but not restarting or reloading the applications that cached the old secret value
Related to Secrets Management Platform
Self-Service Infrastructure, Internal Developer Platform Design
Self-Service Infrastructure — Infrastructure Abstraction
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Self-Service Infrastructure
- Self-service means developers can provision, configure, and manage infrastructure without filing tickets or waiting for a platform team response
- Crossplane and Terraform modules are the two dominant approaches: Crossplane for Kubernetes-native declarative APIs, Terraform modules for broader cloud coverage
- Expose infrastructure as versioned APIs with sensible defaults. Developers request 'a PostgreSQL database for my staging environment,' not 'an RDS instance with specific VPC, subnet, and security group configurations.'
- Guardrails are essential. Use OPA/Kyverno policies to enforce cost limits, security baselines, and compliance requirements without manual approval gates.
- Track provisioning time as a key metric. If it takes more than 15 minutes to get a new environment, you are not self-service yet.
Common Mistakes with Self-Service Infrastructure
- Giving developers raw Terraform with no abstraction. They should not need to understand VPC peering to get a database.
- Building self-service without cost visibility. Teams will over-provision if they cannot see the dollar impact of their infrastructure choices.
- Skipping the cleanup problem. Self-service provisioning without automated deprovisioning creates cloud waste that compounds monthly.
- Making the self-service interface a web form that generates a ticket behind the scenes. This is self-service theater, not actual self-service.
Related to Self-Service Infrastructure
Internal Developer Platform Design, Platform Team Operating Model
Service Mesh Implementation — Infrastructure Abstraction
Difficulty: Expert. Maturity Level: Scaling.
Key Points for Service Mesh Implementation
- Service meshes add the most value above 20-30 microservices where manual mTLS and traffic management become unmanageable
- Sidecar-based meshes (Istio, Linkerd) add 1-3ms latency per hop while eBPF-based (Cilium) operates at kernel level with sub-millisecond overhead
- Start with observability features before enabling traffic management or mTLS to prove value quickly
- Linkerd has the smallest resource footprint and simplest operational model for teams without dedicated mesh operators
- Progressive rollout by namespace lets you validate mesh behavior on non-critical services before production workloads
Common Mistakes with Service Mesh Implementation
- Deploying a service mesh for 5 microservices when a simple HTTP retry library would solve the actual problem
- Enabling mTLS across the entire cluster on day one without testing certificate rotation under load
- Ignoring the control plane resource requirements which can consume 2-4 GB of memory in Istio default configurations
- Not accounting for mesh-unaware services that break when traffic gets redirected through sidecar proxies
Related to Service Mesh Implementation
Internal Developer Platform Design, Container Runtime Platform