Debugging Cloud Costs with DORA Metrics

August 29, 2022. A Thursday morning started like any other—snoozing through the alarm, checking email while sipping coffee. Today’s problem: our cloud bill was out of control.

We’ve been in a bit of a tailspin since moving more of our services to managed Kubernetes clusters on GKE (Google Kubernetes Engine). Costs were climbing, and we weren’t sure where to start looking. That’s when DORA metrics came into the picture.

The Cloud Bill

The last month’s bill was $50K. That’s a lot for us—enough to buy our team enough coffee for a year. But it wasn’t just the size of the bill; it was the sheer number of services that were running, with varying levels of optimization and visibility.

Enter DORA

DORA (Deployment Outcome Routine Analysis) has been a hot topic in software engineering circles since its introduction by GitLab in 2018. It measures four key areas: Lead Time for Changes, Deployment Frequency, Mean Time to Restore, and Change Failure Rate. While primarily focused on DevOps practices, we saw an opportunity to apply these metrics to our cloud costs.

Breaking Down Costs

We started with a cost breakdown:

Compute (VMs and Pods)
Storage
Network egress
Logging and monitoring tools

Each of these categories was leading us down different rabbit holes. For example, the compute section showed high costs from GKE clusters that were misconfigured or running unnecessary services.

Implementing DORA Metrics

We decided to focus on Lead Time for Changes first. We wanted to understand how long it took to deploy new changes and if there were any bottlenecks in our pipeline. Here’s what we did:

Lead Time for Changes:
- We set up a GitOps tool to track all changes being made to the infrastructure.
- Added timestamps to each commit that led to an update in our Kubernetes cluster.
Deployment Frequency:
- We started using continuous integration and deployment pipelines more rigorously.
- Automated testing for every pull request, ensuring we could deploy frequently without breaking anything.
Mean Time to Restore (MTTR):
- Established monitoring alerts for critical services to quickly identify and resolve issues.
- Set up a fallback plan with backup instances ready to go in case of a failure.
Change Failure Rate:
- Improved rollback procedures to ensure that any changes causing problems could be reverted quickly.
- Introduced mandatory reviews before deploying sensitive configurations or features.

Results

After a few weeks, the results were encouraging:

Lead Time for Changes dropped by 30% as we streamlined our deployment process.
Deployment Frequency increased by 25%, allowing us to iterate faster and more confidently.
MTTR improved significantly; we could recover from most issues in under an hour.
Change Failure Rate decreased due to better testing and review processes.

Cost Savings

But the real proof was in the cost savings. Over a few months, we managed to reduce our cloud bill by 20%. We achieved this by:

Right-sizing instances based on actual usage patterns rather than over-provisioning.
Migrating services from high-cost regions to more economical ones.
Eliminating unused resources and optimizing storage costs.

The Road Ahead

Debugging the cloud bill was a challenge, but it forced us to look at our infrastructure in a different way. DORA metrics helped us identify areas for improvement and made the process of cost optimization more structured. It’s not about saving money just to save money; it’s about making smarter decisions that benefit both the business and the environment.

This experience reinforced my belief that every engineer should have a basic understanding of their cloud provider’s billing structure. Too often, we’re so focused on development that we overlook the financial implications of our work.

So, if you find yourself staring at a hefty cloud bill or wondering where to start optimizing your infrastructure, give DORA metrics a try. It might just help you save some coffee money along the way.

Until next time, Brandon