$ cat post/yaml-indent-wrong-/-i-typed-it-and-watched-it-burn-/-i-pushed-and-forgot.md

23OCT23

yaml indent wrong / I typed it and watched it burn / I pushed and forgot

Title: When DevOps Meets FinOps: Navigating the Cost Quagmire

October 23, 2023. It’s been a month of intense discussions about cost optimization in our infrastructure. The tech world is abuzz with AI and LLMs, but for us ops folks, it’s all about keeping costs under control without compromising performance.

Last week, we had an interesting chat with one of the FinOps teams. They were pushing us to find ways to reduce cloud spending by 20%. While I love my DevOps role, I’m not a huge fan of being called out for overspending every month in the name of innovation. But hey, when you’re responsible for running a platform that serves millions of users, cost control is non-negotiable.

One of the challenges we faced was identifying redundant resources. Our stack has grown organically over years—pods here, services there—and it’s hard to see the whole picture. We decided to bring in some cloud-native tools like Cost Management APIs from AWS and Google Cloud to help us get a better view of our costs. These APIs are great for initial snapshots but quickly became overwhelming once we started drilling down into specific costs by service and region.

Another issue was with autoscaling policies that were over-provisioned. We had been using them to handle spikes in traffic, but now they seemed like the culprit for unnecessary cloud spending. I spent a good portion of my day last week tweaking these policies, trying different configurations, and setting up alerts to catch any potential issues before they spiraled out of control.

During one of our stand-ups, the topic came up about using serverless architectures more widely. It sounds appealing—pay only for what you use, no need to manage servers—but in practice, it can be tricky. We argued that while serverless might save us money on idle time, the cold start latency could hurt performance and user experience. Plus, serverless still requires a lot of setup and monitoring, which isn’t necessarily cheaper.

On the bright side, we finally shipped an internal tool for better developer experience (DX). It’s a simple dashboard that provides visibility into application health and resource usage. Developers can use it to quickly identify bottlenecks or memory leaks in their applications without having to dig through logs. This has been a huge win because it empowers them to make informed decisions about optimization, and we’re seeing fewer critical issues hitting production.

The FinOps team was pleased with the progress, but they still had some concerns about long-term sustainability. I reminded myself that while I can tweak policies and ship tools, ultimately, the key is in educating our developers about best practices for cost management. We’ve started running workshops to teach them how to use resources efficiently, set up appropriate alerts, and understand the impact of their code on the environment.

This month has been a mix of frustration—like when I debugged an issue that was actually caused by a misconfiguration in our Kubernetes cluster—and moments of relief—like when we successfully reduced costs by tweaking policies. But it’s clear that cost management is going to be a constant battle, and one that requires a multifaceted approach.

So here’s to hoping the next month brings even more insights and better tools for us ops folks. And maybe, just maybe, I can finally convince everyone that sometimes, DevOps and FinOps don’t have to be at odds.

This post captures some of the real challenges and discussions we had in managing costs while keeping up with the latest technological trends. It’s a reflection of the daily struggles and successes in navigating the complex landscape of cloud infrastructure management.