$ cat post/the-build-finally-passed-/-the-logs-held-no-answers-then-/-it-failed-gracefully.md

the build finally passed / the logs held no answers then / it failed gracefully


Debugging the Cloud Cost Matrix: A Journey Through FinOps


January 29, 2024. Another day in the life of a platform engineer, or so it seemed as I sat staring at my laptop, trying to untangle the web of cloud costs that had grown far too complex for my liking. As the AI/LLM infrastructure explosion raged on and FinOps became more than just a buzzword, the pressure was on to optimize costs without sacrificing performance.

The Context

On the tech front, we were in an interesting phase where platforms like Anthropic and Stability AI had kicked off their own LLMs, each with its own set of infrastructure challenges. Platform engineering was becoming mainstream, but so too was the FinOps landscape overwhelming. DORA metrics were being widely adopted, highlighting the need for continuous improvement in delivery speed and quality. The CNCF continued to grow, offering a dizzying array of tools that I was still trying to keep up with.

The Problem

In our engineering team, we had a complex set of microservices running on AWS, GCP, and Azure. Each service was built using a mix of Kubernetes, Docker, and various managed services. CloudWatch and Cost Explorer were the tools of choice for monitoring costs, but they didn’t provide enough detail to get us where we needed to be.

One morning, as I was reviewing the cost reports, I noticed something odd: our bill had doubled over the past month! It wasn’t a single service that was causing this; it was a combination of all services. The thought crossed my mind that maybe someone had accidentally enabled a high-cost feature or perhaps there were hidden charges we weren’t aware of.

Digging In

I decided to dive deep into our cost management practices. I started by setting up detailed logging for cloud provider APIs and Kubernetes events. This gave me a trail of breadcrumbs to follow. After hours of tracing, I found the culprit: an unoptimized AWS Lambda function that was running more than it needed to.

The Debugging Process

I opened my terminal and ran kubectl get pods to see which services were being deployed on each cloud provider. Then, I switched over to aws lambda invoke --function-name <func_name> --log-type Tail to check the logs of our Lambda functions. This revealed that one particular function was running every 5 seconds, even though it was only supposed to run once a minute.

I dug into the code and found the issue: an unnecessary loop that was causing the function to trigger multiple times. After refactoring the logic, I redeployed the changes and watched as our Lambda costs started to drop. It wasn’t just a matter of saving money; it was about ensuring we were using resources efficiently.

Lessons Learned

This experience reinforced my belief in the importance of thorough testing and continuous monitoring. We needed better tooling to automate these checks, so I proposed integrating a custom dashboard that would alert us whenever there were unexpected spikes in costs or resource utilization.

I also realized that our DevOps practices could use some refinement. We had too many manual steps, and this was causing bottlenecks. I suggested we adopt GitOps principles to streamline our deployment pipeline and ensure consistency across environments.

The Future

As FinOps continues to evolve, so must the tools and practices we use to manage cloud costs. WebAssembly on the server side offers exciting possibilities for optimizing performance and reducing costs, but it’s still in its early stages. I’m excited about exploring how we can leverage these technologies to build more efficient and cost-effective systems.

For now, though, the focus is on improving what we already have. We need to get our DevOps practices right before diving into new technologies. The road ahead will be bumpy, but with the right tools and a willingness to adapt, I’m confident we can navigate through it.


This post is just a snapshot of my journey in 2024. There’s still so much more to learn and do. But for now, as I sit here reflecting on what worked and what didn’t, I feel better equipped to handle the challenges ahead.