$ cat post/the-function-returned-/-what-the-stack-trace-never-showed-/-i-saved-the-core-dump.md

19AUG24

the function returned / what the stack trace never showed / I saved the core dump

Debugging the Cloud FinOps Quagmire

August 19, 2024

It’s been a while since I last wrote anything here. A lot has changed since then—technologically and personally. The tech world is buzzing with AI and LLMs, platform engineering becoming mainstream, and CNCF projects feeling like an overwhelming smorgasbord of choices. WebAssembly is making its way into server-side workloads, FinOps is putting pressure on cloud budgets, and DORA metrics are everywhere. But let’s cut to the chase: the day-to-day grind can still feel like a never-ending cycle.

The Context

Earlier this month, I found myself buried in a particularly complex issue that felt almost too familiar. One of our team members reached out because they noticed an unusual spike in AWS cloud costs. It was one of those moments where you think, “Oh great, another billing anomaly,” but this time it was different.

The Issue

The root cause turned out to be a misconfiguration in the cost allocation tags for our EKS cluster. Turns out, some well-intentioned developer had set up automatic scaling on an instance type that wasn’t ideal for our workload. This led to unexpected resource usage spikes during off-peak hours, which were then being charged to multiple departments instead of staying within a single project.

The Debugging Journey

Initially, I thought it would be a quick fix. After all, it’s just a configuration issue, right? But as I dug deeper, the complexity revealed itself. There was an interaction between cost allocation tags and automatic scaling policies that I hadn’t fully grasped before. Plus, our internal billing tool wasn’t helping much; it only provided aggregated data without any granular insights into specific instances.

I spent hours poring over logs and digging through AWS CloudWatch metrics, trying to trace back the exact moment when the spike occurred. It was like looking for a needle in a haystack, but with multiple layers of complexity.

The Solution

After several false starts (and a few moments of self-deprecation where I wondered if I was even capable of solving this), I found the culprit: an autoscaling group that had been misconfigured to scale up to a type that wasn’t suitable for our needs. Once I pinned down the exact instance, it was straightforward to adjust the scaling policies and apply proper cost allocation tags.

The real challenge, however, came in explaining why this happened and preventing future occurrences. We needed better visibility into our cloud spending and more robust tools for managing costs. It’s one thing to know you’ve got a problem; it’s another to ensure that everyone is aware of the risks and knows how to avoid them.

The Aftermath

Now, we’re in the process of enhancing our internal billing tool with real-time alerts for cost anomalies. We’re also implementing more rigorous testing and review processes for any new changes in cloud infrastructure. This incident has highlighted a need for more transparency and better communication across teams regarding resource usage and costs.

Reflections on FinOps

Reflecting on this, it’s clear that the FinOps landscape is only getting tighter. As developers, we often focus so much on functionality and performance that financial considerations can easily fall by the wayside. But with cloud services becoming such a significant portion of operational expenses, it’s crucial to stay vigilant.

I’ve been thinking about how to better integrate cost management into our DevOps practices. Maybe we need more regular “cost walk-throughs” where we review spending patterns and identify areas for optimization. And perhaps the tools themselves could become smarter—proactively identifying potential cost-saving opportunities before they become issues.

The Takeaway

This experience was a stark reminder of the importance of staying on top of not just technical challenges but also financial ones. It’s easy to get caught up in building cool features, but forgetting about cloud costs can lead to real headaches down the line.

In the end, it wasn’t just about fixing the bill; it was about ensuring that we’re doing everything we can to operate efficiently and transparently. After all, as a platform engineer, I’m not just writing code—I’m also managing budgets and ensuring our infrastructure is used effectively.

Until next time, Brandon