Platform Engineering Blues: When DevOps Meets FinOps

July 18, 2022, felt like any other day at the office. Or so I thought. The morning started with an urgent call from our FinOps team about unexplained spikes in cloud costs. It was a good reminder that even as platform engineers, we can’t afford to ignore financial realities.

The Incident

It all started innocently enough—some new LLM (Large Language Model) experiments on our internal API. A few weeks ago, ChatGPT had everyone buzzing about AI, and we were no exception. Our team had been running a series of tests to integrate LLMs into our platform. We used a managed service from one of the major cloud providers to handle the heavy lifting.

But as we started processing more requests through this new service, something didn’t feel right. Our FinOps team flagged some suspicious charges. They showed us a graph that looked like a rollercoaster—up and down in a way that made no business sense. We knew it had to be related to our recent API tests.

The Investigation

The first thing I did was log into the cloud console, hoping for a simple explanation. But as the numbers rolled in, it became clear we were dealing with something more complex than just a misconfigured quota or a forgotten resource. It seemed like some of our API calls had somehow been cloned and scaled up to absurd levels.

I started digging through logs, examining request patterns, and scrutinizing every single API call. The managed service provider’s documentation didn’t mention anything about potential security vulnerabilities that could lead to this kind of abuse. But then again, they weren’t focused on securing against hobbyists or small teams trying out their APIs for the first time.

Platform Engineering vs FinOps

I quickly realized we needed a multidisciplinary approach here. While I was familiar with platform engineering best practices, I had never really dealt much with FinOps before. We needed to sit down and discuss how to build resilience into our platforms not just from an operational standpoint but also from a financial one.

A Brief Chat

I called up the head of FinOps and we hashed out what we could do to prevent this kind of thing in the future. He explained that they monitor costs using budget alerts, which triggered when spending exceeded certain thresholds. While these were helpful for identifying issues, they weren’t designed to catch every potential misuse scenario.

We talked about implementing rate limiting on our APIs, but decided against it because we needed to allow a high degree of flexibility for developers to experiment. We brainstormed ways to add more visibility into API usage and set up better monitoring at the service level. The idea was to get real-time insights into what was being consumed so that anomalies could be flagged before they became costly.

Lessons Learned

This incident taught me a lot about the intersection of platform engineering and FinOps. While I’m comfortable with setting up robust infrastructure, understanding how to manage costs efficiently is something I need to brush up on. It’s no longer enough just to ensure things are running smoothly; we also need to make sure they’re not sucking up too much money.

We ended up implementing a combination of budget alerts and detailed logging for our API services. We also started using a tool that allowed us to visualize API usage patterns, which made it easier to spot anomalies early on. This collaboration between teams was key—without FinOps’ insights, we wouldn’t have seen the issue at all.

Moving Forward

As platform engineering continues to grow in importance, so too does its relationship with FinOps. We’re not just responsible for building reliable infrastructure anymore; we need to ensure it’s built in a way that doesn’t burn through resources faster than we can keep up. It’s about creating sustainable platforms that balance functionality and cost-effectiveness.

In the end, this wasn’t just another bug or issue to fix—it was an opportunity to rethink how we approach platform development from the ground up. And that’s what makes working in tech so rewarding—always learning something new and adapting to the ever-changing landscape around us.

Platform engineering isn’t just about making things work; it’s also about ensuring they do so efficiently. Sometimes, you find yourself stepping out of your comfort zone, dealing with issues you didn’t anticipate. But that’s part of the journey.