Cost Engineering & Cloud Economics

Costs Are Architecture Decisions

Most engineers treat cloud costs like electricity bills. They notice when the number is shocking and ignore it the rest of the time. Staff engineers treat cloud costs as an architecture input, right alongside latency, reliability, and security.

This distinction matters in interviews. When a Senior Staff candidate talks about designing a system, cost should appear naturally in their trade-off analysis, not as a separate afterthought section. If you chose DynamoDB over Postgres, you should be able to explain the cost model difference (per-request pricing vs provisioned compute) and why that model fits your access pattern.

Investigating a Cost Spike

A doubled cloud bill is a symptom, not a diagnosis. Here is the investigation sequence that experienced engineers follow.

Step 1: Attribution. Which service, team, or environment caused the increase? This requires tagging. If your resources are not tagged by team, service, and environment, you are already behind. Tools like AWS Cost Explorer, CloudHealth, or Vantage can slice spend by tag, but they are only as good as your tagging discipline.

Step 2: Trend decomposition. Separate organic growth from waste. If your user base grew 80% and your bill grew 100%, the real problem is only the 20% delta. Maybe someone left a load test environment running for three weeks. Maybe a logging pipeline started indexing debug-level logs in production. Each root cause has a different fix.

Step 3: Architectural review. Some cost growth is structural. If your monolith was split into 40 microservices each running its own database, your RDS bill reflects that multiplication. Right-sizing will not fix this. It requires an architectural conversation about shared data layers or consolidation.

Coinbase reduced their cloud spend by 40% through architectural changes, not instance right-sizing. They consolidated redundant data pipelines, eliminated cross-region data transfer charges, and replaced expensive real-time processing with batch jobs where latency tolerance allowed it.

Making Teams Accountable Without Bureaucracy

The worst approach to cloud cost accountability is approval gates. Requiring a manager sign-off before provisioning a database guarantees two things: engineers will hate the process, and they will find creative ways around it.

What works instead is visibility. Airbnb built an internal cost attribution system that shows every team their spend in real time, broken down by service, displayed on dashboards alongside uptime and latency. When cost is visible and attributed, teams naturally ask "why is our staging environment costing $8K per month?" without anyone mandating reviews.

Pair visibility with guardrails, not gates. Set automated alerts when a team's spend exceeds its 30-day rolling average by more than 20%. Automatically shut down non-production resources outside business hours (this alone saves 65% on dev/staging environments). Use AWS Service Control Policies to prevent provisioning of expensive instance types without explicitly opting in.

The cultural piece matters as much as the tooling. Spotify tracks cost-per-stream as a key engineering metric alongside availability and latency. When cost becomes a first-class engineering metric rather than a finance team concern, optimization happens organically.

The Cost Conversation with Leadership

Engineers lose credibility with leadership when they present cloud costs as raw numbers. Saying "we need to reduce our AWS bill from $3M to $2M" invites the response "great, just cut a third of your infrastructure." That is not a productive conversation.

Unit economics reframe everything. "Our cost per active user decreased from $1.20 to $0.85 this year while we added 2M users" tells a story about efficiency, not expense. The total bill went up, but the business got more efficient. Cost per request, cost per active user, cost per transaction. These are the metrics that make infrastructure investment legible to executives.

When proposing infrastructure investments, frame them as business cases. "Migrating from self-hosted Elasticsearch to OpenSearch Serverless costs $40K more per year, but it frees up 0.5 FTE of SRE time currently spent on cluster management and incident response." Leaders respond to trade-offs expressed in terms they can evaluate.

Build vs Buy: The Real Math

The managed-vs-self-hosted question comes up in nearly every Senior Staff interview. The trap is treating it as a simple cost comparison.

Total cost of ownership for self-hosted infrastructure includes compute, storage, networking, engineering time, on-call burden, upgrade cycles, and the opportunity cost of what those engineers could build instead. For a managed Kafka cluster, the compute might be $60K, but the fully-loaded engineering cost adds another $100-200K.

The decision framework: self-host when the component is core to your competitive advantage and you need deep customization (LinkedIn runs their own Kafka because they invented it). Buy managed services for everything else.

Costs Are Architecture Decisions

Investigating a Cost Spike

A doubled cloud bill is a symptom, not a diagnosis. Here is the investigation sequence that experienced engineers follow.

Making Teams Accountable Without Bureaucracy

The Cost Conversation with Leadership

Build vs Buy: The Real Math

The managed-vs-self-hosted question comes up in nearly every Senior Staff interview. The trap is treating it as a simple cost comparison.

Costs Are Architecture Decisions

Investigating a Cost Spike

Making Teams Accountable Without Bureaucracy

The Cost Conversation with Leadership

Build vs Buy: The Real Math

Sample Questions

Evaluation Criteria

Key Points

Common Mistakes

Related Topics

Cost Engineering & Cloud Economics

Costs Are Architecture Decisions

Investigating a Cost Spike

Making Teams Accountable Without Bureaucracy

The Cost Conversation with Leadership

Build vs Buy: The Real Math

Sample Questions

Evaluation Criteria

Key Points

Common Mistakes

Related Topics