This article outlines a high-level, repeatable strategy for saving money on the cloud. We developed this strategy as part of our cost-saving product, Pyrae Policy Engine, which regularly saves our customers 25%+ overall on their AWS bills. These methods can be applied repeatedly to any AWS service, or even to other cloud providers, based on the needs of your organization.
Pattern: Outdated
For example: "Previous Generation" EC2 instance types, Migrate from GP2 to GP3, + more Outdated policies
Outdated instances are not always a straightforward comparison, but they typically offer worse price-performance ratios than newer instance types. This means that the newer instance types will likely be able to handle more load at an often cheaper or similar price to the old instance. This means that your services will function with the same load at a lower scaling, which will reduce AWS spend.
An unnamed company saved $18,890/month by migrating from GP2 to GP3
Pattern: Dangling
For example: Dangling EBS volumes, VPC endpoints attached to a VPC with no nodes, + more Dangling policies
Dangling resources are ones which are leftover from some other resource's deletion, but the leftover resource itself is billable. For example, after you delete an EC2 instance, you may decide to not delete the EBS volume immediately. Eg, you may need to roll back the deletion, or you wish to archive it first. Sometimes, you may not even know you're creating a dangling resource (eg, if you delete the last EC2 instance in a VPC, but the VPC has a VPC endpoint attached, the VPC endpoint will continue to be billed for no reason). Whether the resource was forgotten or otherwise, identifying and deleting these dangling resources can save money.
Pattern: ARM/Graviton
For example: EC2/RDS/EKS/ECS should leverage M7g/R7g, + more ARM policies
AWS offers instances using ARM processors, which offer 27% better price-performance than comparable Intel processor-based instances. Migrating an application to ARM does require engineering effort, but most applications will be relatively seamless, particularly ones using interpreted languages (Node, Python, etc).
Most AWS services with an instance type to choose offer ARM-based instances, including EC2, RDS, SageMaker, ElastiCache, etc. For RDS in particular, migrating to ARM is relatively low risk and low effort, only requiring a database migration, so this is a good place to start.
By leveraging Graviton2 or Graviton3, Wealthfront saw savings of 20%, Zendesk saw savings of 42%, Formula 1 saw savings of 30% and anticipates 40%+ is possible, and Instructure saw savings of 15-20%.
Pattern: AMD
For example: EC2 M7a
AWS offers instances using AMD processors. Although they're a bit cagey about the price-performance savings (often comaring against a "previous generation" at the same time), the AMD instances are roughly 10% cheaper than the same-generation Intel processor-based instances. In the event that an application cannot be migrated to ARM due to relying on the x86_64 instruction set or otherwise being excessively costly to migrate, the AMD instance type offering is a good fit.
Blackboard was able to save 28% overall by switching to AMD among other cost saving efforts.
Pattern: Oversized
For example: RDS low CPU utilization, EBS volume excessive provisioned IOPS, Excessive Lambda Provisioned Concurrency, Low RCU/WCU DynamoDB Table should be using Infrequent Access Tier, + more Oversized policies
Oversized resources are ones which are provisioned in such a way that they are not being fully utilized. This may happen because load decreases on a service, or a new AWS pricing tier releases after the initial infrastructure is provisioned, or merely because an imprecise usage estimate isn't re-reviewed later on. This is a very broad category of fix. While we do spinoff some common special cases in this list of patterns, this category tends to contain a lot of the savings impact for organizations just getting started with their cost saving journey.
Pattern: Retention
For example: Excessive EBS backups, unlimited S3 file retention, unlimited CloudWatch Log group retention, + more Retention policies
Retention misconfigurations are when an object is maintained for an excessively long period of time, or with excessive granularity over that period of time. Do you really need hourly backups from three months ago, or would weekly be sufficient? Do you need service logs from 4 years ago? This can be rememdied by either deleting the resource, configuring a more conservative retention policy, or migrating the data into S3 (preferably in an archive tier).
Pattern: Reservations
For example: Reservations are offered for EC2, RDS, ElastiCache, OpenSearch, Redshift, and DynamoDB, + more Reservation policies
When you're confident that an instance will continue to be used for at least the next year, AWS offers a discount (up to 72%; typically 30%) for making the committment to continue to use the resource for 12 or 36 months, and optionally paying for some fraction of that resource up front. You can either make this committment at the executive level, or, preferably, you can direct your teams to make the committments individually as they would have the best understanding of their infrastructure. For organizations just getting started with their cost saving journey AND with traditional instance-based infrastructure, this category is one of the most significant opportunities for savings.
Pattern: Savings Plans
For Example: Savings plans are offered for EC2 (any instance type), EC2 (specific instance types), Fargate, Lambda, and SageMaker
When you're confident that you will continue to spend at least $X/hour on a supported AWS service, Savings Plans are a committment to spend that $X every hour on that service, even if consumed services are less than the committment. In exchange, you receive a discounted rate on related services up until the committed $X/hour. Even though savings plans offer quite good rates on paper ("up to 72%"), most businesses have cyclic usage patterns with daytime peaks and nighttime troughs, which will limit the committment.
Pattern: Autoscaling
For Example: ECS missing autoscaling policy, EC2 ASG should use Attribute Based Instance Selection, Redshift missing Pause/Resume policy, + more Autoscaling policies
Compared to on-prem, one of the key differentiators of the cloud is the ability to scale services dynamically. When you fail to configure autoscaling, you're effectively paying for services to sit there unused. There are three main cost-related ways that an autoscaling policy can be misconfigured:
- The autoscaling policy is missing entirely.
- The autoscaling policy is not utilizing spot instances.
- The autoscaling policy is not using attribute-based instance selection.
Spot instances offer 30-90% savings over On-Demand. In this case study, ITV saved on average 60% per instance, and were able to leverage Spot instances for 24% of their compute instances; a 14.4% overall average savings. ASGs can leverage spot instances automatically, such Amazon spins up the cheapest compatible spot instance, and when that instance becomes unavailable, it spins up the next-cheapest instance. Most applications that can boot up in a few minutes, particularly stateless applications, will have no issue leveraging spot instances.
Attribute-based instance type selection enables Amazon to pick any instance types that meet specified criteria (CPU, memory, network, etc). This strategy means that when new instance types are released by amazon, if they are cheaper, your instances will automatically start using it. By leveraging this strategy, Druva was "able to bring down compute costs for its EC2 Spot usage by 10-15%.".
Interestingly, Attribute-based instance type selection and Spot instances strategies can be combined for compound savings, which means that Amazon will pick the cheapest spot instance, which sometimes might not even be the smallest instance available.
Pattern: Business Hours
For Example: Suspending developer instances, Downscaling in the dev environment at night, etc.
When everybody's asleep, the dev cluster doesn't need a node in three AZs. It doesn't need the reliability or capacity at those hours. So, you can save a bit of money by aggressively downscaling. Note that this is very difficult to leverage outside of Dev environments; basically limited to Back Office-related UX in Prod. An AWS guide to implement the Business Hours strategy indicates that savings can be up to 70% if the reduction is from 168 hours to 50 hours per week.
Pattern: Decommission / Idle
For example: RDS no connections in 30d, DynamoDB no writes in 30d, ECS near-zero CPU, + more Idle policies
This is very similar to the Dangling pattern. These are resources which were probably left behind, and should probably be deleted. However, this has a subtly different risk and effort profile compared to Dangling. With dangling, we are confident that the resource is not used or referenced anywhere (eg, the VPC definitely has nothing in it). However, The Decommission pattern involves indicators of idleness, but no guarantee of disuse. Eg, a database may be used for quarterly reporting, but show no connections in the last 30 days mid-quarter. So, more thorough review is required to identify whether these resources can be safely deleted.
Pattern: Bandwidth
For example: CloudFront Excessive Price Class, CloudFront not utilizing compression, VPC missing PrivateLink to S3/DynamoDB or another service
AWS bandwidth is notoriously difficult to understand the pricing for. The billing for bandwidth is a popular running joke; you can price out your whole infrastructure, but for bandwidth, you just get to find out once it's all built. In general, although there are a LOT of exceptions:
- Bandwidth inside of the same AWS Region and AWS AZ is generally free
- Bandwidth inside of the same AWS Region but between different AZs costs $0.01/GB
- Bandwidth between AWS regions is typically $0.01 or $0.02/GB based on the origin region.
- Bandwidth from the public internet INTO AWS is free
- Bandwidth from AWS to the public internet is generally $0.09/GB, but decreases with higher consumption
This just establishes a lower floor because other services may layer charges onto the same bandwidth, which makes determining the true cost of specific packets difficult. For example, if bandwidth goes from one AWS Region to another, you pay the cross region charge, but if it arrives to a NAT gateway, those same packets will also receive a NAT gateway $0.045/GB charge. This is just to illustrate that bandwidth's costs are particularly insidious and obscured.
In general, the bandwidth-related opportunities you're looking for can be categorized as:
- Bandwidth reduction efforts (Eg, enabling compression, decreasing sync frequency)
- Relocating bandwidth to more cost effective regions (Eg, the default CloudFront price class puts bandwidth in all regions, including expensive ones. If your customers can tolerate added latency, money can be saved by moving that bandwidth to a cheaper price class.)
- Reducing bandwidth that goes over the public internet by establishing VPC Endpoints/Private Links with AWS Services (S3, DynamoDB) and third party vendors for which you have a lot of bandwidth. For example, TextNow saved 93% on bandwidth charges by establishing a PrivateLink with Datadog. This is because this bandwidth was previously going over the public internet and incuring the $0.09/GB charge, but both TextNow and Datadog are AWS customers, so they can leverage same region or cross-region pricing after the PrivateLink was established.
Pattern: Retries / Errors
For example: Excessive Lambda error rate, high Step Function retry limit, + more ErrorRate policies
Well-architected framework expects our code to fail often, even during normal operation, and for us to handle these issue gracefully, often by retrying the request several times.
The cost of this category is best explained by example. Let's assume we have a Lambda that retries a maximum of twice. Normal operation should be somewhere in the ballpark of 99%+ success rate, which means that <1% of requests are executed a second time, and <0.01% of requests are attempted a third time. So, we're paying for an average of 1.01 executions per request.
Let's assume there's a defect in our code, and the success rate is now 80%; our system still mostly functions. However, 20% of requests are executed a second time, and 4% are executed a third time. This is an aggregate of 1.24 executions per request, although, in the end, 80 + (80*0.2) + (80*0.2*0.2)
= 99.2% of requests will succeed. Our customers might never notice, but the bill sure did; the cost of this service just increased by ~23%! And this is before factoring in that other processes may be waiting on this to complete, calls to dependents, etc, which compounds the cost increase.
A second category of cost-related Retry issues exists. Poor retry architecture may result in a retry storm. It doesn't affect steady state price, but during an outage, traffic can easily be tens to hundreds of times higher, which is a significant cost risk. Hypothetical: Service A calls B, B calls C, C calls D, and D calls E, and each one retries calls to its dependent up to three times. In the event of a total outage of service E, and the error bubbles all the way up to Service A, the nested retries will "storm" calls to their dependent services. The net effect is that A will send four times as many requests to service B, which will call Service C sixteen times as much, ... etc ... which results in 256 times as many calls to service E. If any other services are also called before the failing call, they will also see a multiplication in traffic, spiraling the cost out to a service's peers. It's entirely possible for a mere hours-long outage to result in more charges than an entire month of steady state.
Pattern: Undersized
For example: EC2 high CPU utilization, + more Undersized policies
When an AWS resource is undersized, other resources that depend upon it may fail more often, which results in a higher retry rate in those services. This is very similar to the Error Rate pattern, except it is separated because not every cloud architecture reports has a conveniently reported error rate. If the CPU is 100%, it's likely that calls to it are failing, but we don't necessarily know that they are. This category is more like an indicator of an error rate, warranting further investigation.
Where do I begin?
Treatment and prevention will be the subject of a follow-up post. In the meantime, if you want a jump start on assessing the principles outlined here, Pyrae offers a free, no obligation, customized report of your AWS infrastructure by executing 100+ policies aligned with the patterns outlined in this post.