If you are in FinOps, a Cloud Infrastructure developer in payments, or just want to manage AWS infrastructure costs efficiently, check this out.
VGS has been running on AWS since 2016, fully embracing the cloud to support our infrastructure and growth. Our CTO and co-founder, Marshall Jones, is focused on building a technical vision that enables our business and controls costs. One of our key engineering goals this year is optimizing AWS spending relative to usage. We measure this goal by examining our "AWS costs per unit sold."
Pavlo Ulanov, a member of our SRE team, has been instrumental in this effort, saving VGS hundreds of thousands of dollars through his team's FinOps practice. Pavlo shares one of his key initiatives in this blog post: achieving a 20% reduction in AWS RDS database spending by migrating Aurora I/O-Optimized.
Background
Amazon introduced Aurora I/O-Optimized in 2023 as an extension of Amazon Aurora, a managed relational database service from AWS compatible with MySQL and PostgreSQL databases. Aurora's pay-per-request model was designed for cost-efficiency in managing input/output (I/O costs)—sometimes at one-tenth the price of traditional databases—making it ideal for low-to-moderate I/O workloads.
The new Optimized storage configuration was designed for applications with heavy I/O workloads to deliver enhanced performance, lower latency, and predictable pricing. E-commerce and payment processing systems were cited as common use cases, and VGS fit the bill.
Applying Aurora I/O-Optimized at VGS
Earlier this year, VGS reviewed its AWS RDS costs as part of a regular assessment of the efficiency of our AWS usage. We are a cloud-native organization whose primary RDS usage has been on Aurora, so it was natural to start there.
When AWS announced Aurora I/O-Optimized and positioned it as offering an improved price/performance profile depending on an organization's Aurora usage patterns, our FinOps and Cloud Infrastructure/DevOps/SRE teams were intrigued by the opportunity to reduce our cloud costs.
Pavlo supported this effort as a Senior Infrastructure Engineer at VGS, focusing on happy async communication, Cloud Operations, Cloud Cost control, Observability, and Incident Management. The RDS spend described in this article is not typically covered by any famous Finops tools or built-in AWS reports and has to be found on its own.
Step 1
Reviewing our RDS spend to identify the most expensive cluster
Step 2
Switching to I/O Optimized storage configuration
Step 3
Achieving the intended cost savings
Reviewing our RDS spend to identify the most expensive cluster
If you have looked at Cost Explorer, Aurora has many different usage types. We will review the Aurora:StorageIOUsage Usage Type and discuss how we optimized it.
Aurora:StorageIOUsage accounted for a significant portion of our overall daily RDS spend at around 30%—40%, making it an ideal starting point for our review.
Daily RDS Spend Grouped by Usage Type:
Aurora:StorageIOUsage is 30% - 40% of our total RDS spend
In trying to understand which database or usage contributes to your costs, you may assume that the most used database with the most active connections would cost the most, but that's not always the case.
Here is the official explanation from AWS from the FAQ: “What are I/O operations in Aurora and how are they calculated?”
“I/O operations are performed by the Aurora database engine against its SSD-based virtualized storage layer. Every database page read operation counts as one I/O.
The Aurora database engine issues reads against the storage layer to fetch database pages not present in memory in the cache: If your query traffic can be totally served from memory or the cache, you will not be charged for retrieving any data pages from memory.”
Method:
Luckily, AWS CloudWatch provides a metric for evaluating storage I/O operations and identifying the most expensive RDS clusters. This metric is called VolumeReadIOPs for read operations or VolumeWriteIOPs for write operations.
CloudWatch Metrics VolumeWriteIOPs and VolumeReadIOPs:
For the entire month, grouped by 1-hour intervals
Observation:
Only one or two clusters have millions of Volume I/O operations, while the rest have very few.
Why is this usage inconsistent? A daily spike lasts for a few hours before it settles down. Let's explore the cluster with the most I/O operations.
Digging Deeper:
- Using Performance Insights, we can see that the IO:DataFileRead event is the most popular one, and the timing matches the CloudWatch metrics posted above. The application behind this usage is a daily and weekly cron job that performs a database cleanup.
- Looking at the CloudWatch metrics again, the Volume I/O operations spike on Mondays. I had to take more recent data because of the Performance Insights retention policy (7 days in our case), but the overall pattern is the same as I've shown above.
Performance Insights Comparing CloudWatch Metrics on VolumeWriteIOPs and VolumeReadIOPs:
Identifying the most expensive cluster, grouped by 1-hour intervals
Result:
Those graphs are perfectly aligned! We have found an RDS cluster we want to migrate first to Aurora I/O-Optimized.
Switching to I/O Optimized storage configuration
We use infrastructure as code, so switching to the new storage type wasn't hard. We made the change during the RDS maintenance window.
The console switch to modify the existing RDS cluster required a single click, and no downtime was required. We didn't observe any issues during or after the switch.
Screenshot from the VGS AWS Console:
Modifying the Aurora cluster storage configuration
Result:
The results were visible in Cost Explorer almost immediately. In the hourly view, daily/weekly cost spikes disappeared.
Let's review the daily spending below after we implemented the change, using a subset of RDS usage types.
Cost Explorer Daily View:
Grouped by Usage Types
Result:
With the switch, Aurora:StorageIOUsage dropped dramatically while a couple of new usage types showed up - Aurora:IO-OptimizedStorageUsage - and new Instance Usages for I/O-optimized instances. Weekly usage spikes have gone away, too.
Achieving the intended cost savings
For VGS, the change resulted in a significant cost improvement of hundreds of thousands of dollars with zero downsides.
Overall, our RDS spend dropped by ~20%!
Additional Considerations
Before choosing I/O-Optimized storage mode for Aurora, consider that:
- Your cluster should have a specific engine version and supported instance type.
- You can only switch storage types once every 30 days.
- You need to carefully calculate whether I/O-optimized storage type is a suitable option for you. AWS recommends this “"if your I/O spend exceeds 25 percent of your current Aurora database spend.”
- You should also check your latest official AWS documentation.