$ cat post/a-segfault-at-three-/-we-patched-it-and-moved-along-/-a-segfault-in-time.md

a segfault at three / we patched it and moved along / a segfault in time


Debugging the FinOps Quagmire


November 20, 2023. I woke up to another day of wrestling with our platform’s cost management systems—another day of trying to untangle the financial spaghetti that is FinOps in 2023.

I’ve been at this platform engineering gig for a while now, and it seems like every new month brings its own set of challenges. This week was particularly rough because we had just hit another major billing cycle, and our costs were through the roof—literally. The DevOps team and I had spent hours poring over logs and metrics trying to figure out where all this money was going.

First off, let’s talk about OpenAI’s board firing Sam Altman. It feels like every day there’s a new twist in that drama, but honestly, it hasn’t directly impacted my work much beyond the occasional heated discussion on Slack. Still, it’s hard not to think about how these big tech moves ripple through the industry and affect our day-to-day.

On to something more practical. We had been using AWS for years now, but as costs started creeping up, we decided to look into FinOps tools. The landscape is overwhelming, with so many options from Cloud Provider dashboards to third-party services like Podium, Cost Management Tools by Google, and numerous open-source solutions like Cost-Insight.

We settled on a few options and set up an initial trial, but it quickly became clear that this was going to be harder than we thought. We had some early wins—like identifying our most expensive services—but the deeper we dug, the more issues we found. One of the biggest challenges was integrating these tools with our existing infrastructure, which is a mix of AWS, Kubernetes, and self-managed servers.

The other day, I spent hours trying to get one of our microservices running on AWS Fargate while also syncing its costs through the FinOps tool. It’s one thing to read about it in documentation, but when you’re actually doing it… well, let’s just say there were a few moments where I wanted to throw my laptop out the window.

But that’s what makes this job so rewarding, and honestly, so frustrating. We finally got something working, only to find another issue the next day. This cycle of debugging, refactoring, and retesting is exhausting but also deeply satisfying. Each small win feels like a victory in an otherwise daunting battle.

Another aspect of FinOps that’s been eating at me lately is the sheer amount of manual effort required to manage costs effectively. We have this wonderful DevOps pipeline set up for continuous integration and deployment (CI/CD), but it’s not automated when it comes to optimizing costs. We spend a lot of time manually adjusting settings, running scripts, and tweaking configurations.

On a brighter note, I got to work with some new tools that might help us automate some of this process. One of them is the AWS Cost Explorer API, which allows us to programmatically query cost data. We started building a simple script to automatically adjust our Fargate instances based on usage patterns. It’s still in its early stages, but it feels like progress.

Speaking of progress, the DORA metrics have been ingrained into my thought process now. Every time we hit another billing cycle with unexpected costs, I can’t help but think about how much room for improvement there is. We’re not hitting our service deployment goals, and that’s a clear sign we need to tighten up our processes.

One of the things I’ve come to realize more than ever is the importance of developer experience (DX). In trying to automate cost management, we’re often focused on efficiency at the expense of ease for our developers. We’re working on creating more intuitive interfaces and better documentation so that our teams can manage their costs without feeling like they’re jumping through hoops.

As I write this, the build server is churning away in the background, pushing yet another microservice to production. The clock ticks forward, and with it comes another round of cost reporting and analysis. It’s a never-ending cycle, but one that keeps me engaged and challenged every day.

In the end, it’s all about finding balance—balancing costs against service quality, automation against manual effort, and efficiency against developer experience. I’m not sure where this journey will take us, but I’m excited to see what the next few months bring.

Until then, keep debugging, keep optimizing, and most importantly, keep pushing boundaries.


P.S. For anyone else out there dealing with FinOps nightmares, know that you’re not alone. Hang in there!