Cloud Cost Optimization

Where the Money Actually Goes

Cloud cost surprises happen because engineers spin up resources once and then forget about them. A typical AWS bill breaks down something like this: compute 60-70% (EC2, ECS, Lambda), storage 15-20% (S3, EBS, RDS storage), data transfer 5-15% (cross-AZ, cross-region, internet egress), and managed services 5-10% (RDS, ElastiCache, Elasticsearch). Your first step should be understanding your own breakdown, because it might look very different from this.

Right-Sizing: The Highest Leverage

Right-sizing is about matching instance types and sizes to what your workloads actually need. AWS Cost Explorer, Google Cloud Recommender, and third-party tools like Datadog or CloudHealth can analyze your utilization patterns and recommend downsizing. The typical finding is that 30-40% of instances can drop at least one size without any performance hit.

Run the analysis on 14 days of production data. Look at P95 CPU and memory utilization, not averages. Averages hide spikes that would bite you after downsizing. Start with non-production environments where the stakes are lower. A t3.xlarge sitting at 15% CPU is just burning money. A t3.medium would handle that load without breaking a sweat.

Reserved Instances and Savings Plans

Once you've right-sized, commit to what you know you'll keep using. AWS Reserved Instances (RIs) and Savings Plans save 30-40% on 1-year commitments and 50-60% on 3-year commitments. The tradeoff is flexibility: if your architecture shifts, you might end up paying for instance types that no longer make sense.

Start conservatively. Cover your baseline steady-state compute with reservations and handle burst traffic with on-demand pricing. A common approach is to reserve 60-70% of your compute footprint and use on-demand or spot for the rest.

Spot Instances for Fault-Tolerant Workloads

Spot instances (AWS) or preemptible VMs (GCP) give you 60-90% discounts, but the cloud provider can reclaim them with about 2 minutes notice. This works great for CI/CD runners, batch data processing, stateless web workers behind auto-scaling groups, and dev/test environments.

The key is designing for interruption. Use checkpointing for long-running jobs. Spread across multiple instance types and availability zones. Set up instance fleet configurations that automatically fall back to other types when your preferred spot capacity dries up.

Storage Lifecycle and Data Transfer

Storage costs have a way of creeping up on you. S3 Intelligent-Tiering handles access tier transitions automatically. If you know your access patterns, set up explicit lifecycle policies: move to Infrequent Access after 30 days, Glacier after 90 days, Deep Archive after 365 days. A single lifecycle policy across your data lake can knock storage costs down by 40-80%.

Data transfer is the hidden tax of cloud computing. Cross-AZ traffic runs $0.01/GB on AWS, which seems tiny per request but adds up fast at scale. Architecture decisions that reduce cross-AZ chattiness (co-locating dependent services, using AZ-aware routing) can save thousands per month. Cross-region replication for disaster recovery is necessary but expensive, so replicate selectively rather than mirroring everything.

Where the Money Actually Goes

Right-Sizing: The Highest Leverage

Reserved Instances and Savings Plans

Spot Instances for Fault-Tolerant Workloads

Storage Lifecycle and Data Transfer

Where the Money Actually Goes

Right-Sizing: The Highest Leverage

Reserved Instances and Savings Plans

Spot Instances for Fault-Tolerant Workloads

Storage Lifecycle and Data Transfer

Key Points

Common Mistakes

Related Topics

Cloud Cost Optimization

Where the Money Actually Goes

Right-Sizing: The Highest Leverage

Reserved Instances and Savings Plans

Spot Instances for Fault-Tolerant Workloads

Storage Lifecycle and Data Transfer

Key Points

Common Mistakes

Related Topics