Cloud Cost Optimization
Where the Money Actually Goes
Cloud cost surprises happen because engineers spin up resources once and then forget about them. A typical AWS bill breaks down something like this: compute 60-70% (EC2, ECS, Lambda), storage 15-20% (S3, EBS, RDS storage), data transfer 5-15% (cross-AZ, cross-region, internet egress), and managed services 5-10% (RDS, ElastiCache, Elasticsearch). Your first step should be understanding your own breakdown, because it might look very different from this.
Right-Sizing: The Highest Leverage
Right-sizing is about matching instance types and sizes to what your workloads actually need. AWS Cost Explorer, Google Cloud Recommender, and third-party tools like Datadog or CloudHealth can analyze your utilization patterns and recommend downsizing. The typical finding is that 30-40% of instances can drop at least one size without any performance hit.
Run the analysis on 14 days of production data. Look at P95 CPU and memory utilization, not averages. Averages hide spikes that would bite you after downsizing. Start with non-production environments where the stakes are lower. A t3.xlarge sitting at 15% CPU is just burning money. A t3.medium would handle that load without breaking a sweat.
Reserved Instances and Savings Plans
Once you've right-sized, commit to what you know you'll keep using. AWS Reserved Instances (RIs) and Savings Plans save 30-40% on 1-year commitments and 50-60% on 3-year commitments. The tradeoff is flexibility: if your architecture shifts, you might end up paying for instance types that no longer make sense.
Start conservatively. Cover your baseline steady-state compute with reservations and handle burst traffic with on-demand pricing. A common approach is to reserve 60-70% of your compute footprint and use on-demand or spot for the rest.
Spot Instances for Fault-Tolerant Workloads
Spot instances (AWS) or preemptible VMs (GCP) give you 60-90% discounts, but the cloud provider can reclaim them with about 2 minutes notice. This works great for CI/CD runners, batch data processing, stateless web workers behind auto-scaling groups, and dev/test environments.
The key is designing for interruption. Use checkpointing for long-running jobs. Spread across multiple instance types and availability zones. Set up instance fleet configurations that automatically fall back to other types when your preferred spot capacity dries up.
Storage Lifecycle and Data Transfer
Storage costs have a way of creeping up on you. S3 Intelligent-Tiering handles access tier transitions automatically. If you know your access patterns, set up explicit lifecycle policies: move to Infrequent Access after 30 days, Glacier after 90 days, Deep Archive after 365 days. A single lifecycle policy across your data lake can knock storage costs down by 40-80%.
Data transfer is the hidden tax of cloud computing. Cross-AZ traffic runs $0.01/GB on AWS, which seems tiny per request but adds up fast at scale. Architecture decisions that reduce cross-AZ chattiness (co-locating dependent services, using AZ-aware routing) can save thousands per month. Cross-region replication for disaster recovery is necessary but expensive, so replicate selectively rather than mirroring everything.
Key Points
- •Compute typically accounts for 60-70% of cloud spend. Right-sizing instances is the highest-leverage optimization
- •Reserved Instances and Savings Plans can reduce compute costs by 30-60% with 1-3 year commitments
- •Spot/preemptible instances offer 60-90% discounts for fault-tolerant workloads like batch processing and CI/CD
- •Storage lifecycle policies automatically move infrequently accessed data to cheaper tiers, saving 40-80%
- •Cost allocation tags are foundational. You cannot optimize what you cannot attribute to a team or service
Common Mistakes
- ✗Over-provisioning resources out of fear. Most production instances run at 10-20% CPU utilization
- ✗Buying Reserved Instances before right-sizing, locking in waste for 1-3 years
- ✗Ignoring data transfer costs, which can silently become 10-15% of your total bill
- ✗Treating cost optimization as a one-time project instead of an ongoing practice