AI Infrastructure Cost Management — Platform Operations
Difficulty: Expert. Maturity Level: Optimizing.
Key Points for AI Infrastructure Cost Management
- GPU costs run 5-10x higher than equivalent CPU instances, making right-sizing the single highest-leverage cost optimization for AI workloads
- LLM API costs scale linearly with usage, and what looks like $500/month during a beta can easily become $50K/month at production scale if you are not tracking per-request costs from day one
- Implement chargeback for AI compute so product teams see their actual costs. Teams that see their GPU bill make very different architectural decisions than teams with a blank check.
- GPU scheduling and cluster-level orchestration improves utilization from the typical 15-30% range to 60-80%, often saving more than any single model optimization
- Smaller fine-tuned models deliver 90% of the quality at 10% of the cost for most production use cases. The largest model is rarely the right model.
Common Mistakes with AI Infrastructure Cost Management
- Treating GPU instances like CPU instances and leaving them running 24/7 for workloads that only need them during business hours or training runs
- No per-request cost tracking for LLM API calls. Without knowing what each call costs, you cannot optimize prompts, detect runaway loops, or forecast budgets.
- Defaulting to the largest available model for every use case. GPT-4 class models cost 30-60x more per token than GPT-3.5 class models, and most tasks do not need the extra capability.
- Ignoring inference costs during model design. A model that is 2% more accurate but requires 4x the compute for inference is usually a bad trade in production.
Related to AI Infrastructure Cost Management
Platform Team Operating Model, Self-Service Infrastructure
Developer Portal & Backstage — Developer Platforms
Difficulty: Intermediate. Maturity Level: Scaling.
Key Points for Developer Portal & Backstage
- A developer portal is the UI layer of your IDP — it provides a single pane of glass for service catalog, documentation, and self-service workflows
- Backstage (Spotify, CNCF) is the dominant open-source option with a plugin architecture and strong community, but requires significant investment to operationalize
- Start with the software catalog (service ownership, API docs, dependencies) before adding self-service scaffolding or CI/CD visibility
- Portal adoption depends on making it the fastest way to get information — if developers still need to check Slack, Confluence, and PagerDuty separately, the portal fails
- Plugin development is the long game — the portal becomes valuable when teams contribute domain-specific plugins (cost dashboards, security scans, compliance status)
Common Mistakes with Developer Portal & Backstage
- Deploying Backstage out of the box without customization and expecting developers to adopt it — vanilla Backstage solves almost nothing
- Building a portal without populating the software catalog first — an empty catalog teaches developers that the portal is useless
- Treating the portal as an ops tool rather than a developer tool — the primary audience is application developers, not platform engineers
- Underestimating the maintenance burden of Backstage — upgrades, plugin compatibility, and catalog data freshness require ongoing investment
Related to Developer Portal & Backstage
Internal Developer Platform Design, Golden Paths & Paved Roads
Developer Productivity Metrics — Productivity & Metrics
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Developer Productivity Metrics
- DORA metrics (deployment frequency, lead time, change failure rate, MTTR) measure delivery performance, not individual productivity
- SPACE framework (Satisfaction, Performance, Activity, Communication, Efficiency) provides a multidimensional view that resists gaming
- Never use metrics to compare individual developers — they are system-level indicators, not performance reviews
- Leading indicators (CI build time, PR review latency, environment provisioning time) are more actionable than lagging indicators (deployment frequency)
- Instrument automatically from your CI/CD and SCM systems — manual reporting introduces bias and overhead
Common Mistakes with Developer Productivity Metrics
- Measuring lines of code, commit count, or PRs merged as productivity proxies — these incentivize the wrong behaviors
- Publishing individual developer metrics on dashboards — this destroys psychological safety and encourages gaming
- Treating DORA metrics as targets rather than signals — Goodhart's Law applies aggressively to developer metrics
- Measuring only speed without quality — high deployment frequency with a high change failure rate is not productivity, it is chaos
Related to Developer Productivity Metrics
Platform Team Operating Model, Internal Developer Platform Design
Golden Paths & Paved Roads — Developer Platforms
Difficulty: Intermediate. Maturity Level: Foundation.
Key Points for Golden Paths & Paved Roads
- Golden paths make the right thing the easy thing — they are opinionated defaults, not enforced mandates
- Start with service templates (cookiecutters/scaffolds) that include CI/CD, observability, and security out of the box
- Guardrails over gates — use policy-as-code to warn and nudge rather than hard-blocking developers
- Measure adoption rate organically — if less than 80% of new services use the golden path, improve the path rather than forcing compliance
- Version your golden paths and support migration tooling so existing services can upgrade to new standards
Common Mistakes with Golden Paths & Paved Roads
- Making golden paths mandatory without making them better than the alternative — developers will route around bad abstractions
- Creating a single golden path when your organization has fundamentally different workload types (APIs vs event processors vs ML pipelines)
- Neglecting Day 2 operations — a template that creates a service but does not help with debugging, scaling, or upgrading is only half the story
- Not involving application developers in designing the paths — platform engineers often optimize for infrastructure elegance rather than developer ergonomics
Related to Golden Paths & Paved Roads
Internal Developer Platform Design, Self-Service Infrastructure
Internal Developer Platform Design — Developer Platforms
Difficulty: Advanced. Maturity Level: Foundation.
Key Points for Internal Developer Platform Design
- An IDP is a product, not a project — it requires product management, user research, and iterative delivery
- Layer architecture: developer interface (portal/CLI), integration layer (APIs/webhooks), resource layer (infrastructure primitives)
- Build the thinnest possible platform that solves real friction — resist building abstractions nobody asked for
- Assemble from existing tools (Backstage, Argo, Crossplane) rather than building from scratch — composition over creation
- Measure success by developer adoption rate and time-to-production, not by number of features shipped
Common Mistakes with Internal Developer Platform Design
- Building an IDP as a top-down mandate without understanding actual developer pain points — always start with user research
- Over-abstracting infrastructure so developers lose visibility into what is actually running underneath
- Treating the platform as a one-time project rather than a continuously evolving product with a roadmap
- Forcing adoption through policy instead of making the platform genuinely easier than the alternative
Related to Internal Developer Platform Design
Golden Paths & Paved Roads, Developer Portal & Backstage
ML Platform & AI Golden Paths — Developer Platforms
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for ML Platform & AI Golden Paths
- ML golden paths cut experiment-to-production time from months to days by standardizing the boring parts: packaging, deployment, monitoring, and rollback
- The key design principle is abstracting infrastructure complexity while preserving experiment flexibility. Data scientists should never write Dockerfiles, but they should always control hyperparameters.
- Three capabilities are essential before anything else: a feature store for consistent feature computation, experiment tracking for reproducibility, and a model registry for versioning and lineage
- Self-service model deployment with built-in canary rollouts, A/B testing, and one-click rollback removes the biggest bottleneck in ML organizations
- Measure your ML platform by adoption rate and time-to-production. If data scientists are not using it, the platform is wrong, not the scientists.
Common Mistakes with ML Platform & AI Golden Paths
- Building an ML platform before you have proven ML use cases in production. You need at least 2-3 models running manually before you know what to automate.
- Over-abstracting the platform so data scientists cannot debug failures. When a training job fails, they need to see logs, not a generic 'pipeline error' message.
- Treating the ML platform as entirely separate from the application platform. Shared capabilities like CI/CD, secrets management, and observability should be reused, not rebuilt.
- Not involving data scientists in platform design. Engineers build what is elegant. Scientists need what is practical. These are often different things.
Related to ML Platform & AI Golden Paths
Golden Paths & Paved Roads, Internal Developer Platform Design
Platform Team Operating Model — Platform Operations
Difficulty: Expert. Maturity Level: Optimizing.
Key Points for Platform Team Operating Model
- A platform team is a product team — it needs a product manager, user research, a roadmap, and regular stakeholder communication
- Team Topologies defines the platform team as an enabling team that reduces cognitive load on stream-aligned (feature) teams
- Fund the platform team as a product investment, not as a cost center — tie funding to developer productivity outcomes, not headcount
- Define platform SLOs (provisioning latency, API availability, golden path adoption rate) and treat them with the same rigor as production service SLOs
- Run regular developer experience surveys and use the feedback to prioritize the roadmap — the platform team's customers are internal developers
Common Mistakes with Platform Team Operating Model
- Staffing the platform team with only infrastructure engineers — you need product managers, developer advocates, and frontend engineers too
- Building features nobody asked for because the platform team finds them technically interesting — validate demand before building
- Operating as a service team that takes requests rather than a product team that identifies and solves systemic problems
- Not having an explicit support model — developers need to know how to get help, what the response SLAs are, and where to file feature requests
Related to Platform Team Operating Model
Internal Developer Platform Design, Developer Productivity Metrics
Self-Service Infrastructure — Infrastructure Abstraction
Difficulty: Advanced. Maturity Level: Scaling.
Key Points for Self-Service Infrastructure
- Self-service means developers can provision, configure, and manage infrastructure without filing tickets or waiting for a platform team response
- Crossplane and Terraform modules are the two dominant approaches — Crossplane for Kubernetes-native declarative APIs, Terraform modules for broader cloud coverage
- Expose infrastructure as versioned APIs with sensible defaults — developers request 'a PostgreSQL database for my staging environment' not 'an RDS instance with specific VPC, subnet, and security group configurations'
- Guardrails are essential — use OPA/Kyverno policies to enforce cost limits, security baselines, and compliance requirements without manual approval gates
- Track provisioning time as a key metric — if it takes more than 15 minutes to get a new environment, you are not self-service yet
Common Mistakes with Self-Service Infrastructure
- Giving developers raw Terraform with no abstraction — they should not need to understand VPC peering to get a database
- Building self-service without cost visibility — teams will over-provision if they cannot see the dollar impact of their infrastructure choices
- Skipping the cleanup problem — self-service provisioning without automated deprovisioning creates cloud waste that compounds monthly
- Making the self-service interface a web form that generates a ticket behind the scenes — this is self-service theater, not actual self-service
Related to Self-Service Infrastructure
Internal Developer Platform Design, Platform Team Operating Model