$ cat post/the-daemon-restarted-/-the-cluster-held-until-dawn-/-the-shell-recalls-it.md

27FEB23

the daemon restarted / the cluster held until dawn / the shell recalls it

Dealing with the Demands of DevOps in a FinOps World

February 27, 2023. I woke up to another notification about the latest AI milestone—this time it was Bing’s ethical statements. It made me wonder how these changes would impact our own AI projects here at [Company Name]. As we’ve been focusing on platform engineering, we’re not just keeping up with the tech but also trying to make our operations more efficient and cost-effective.

Last week, I had a tough discussion about FinOps within my team. The reality is that as engineers, we often focus so much on making systems work that we overlook how they impact the business in terms of costs. With DORA metrics becoming widely adopted, there’s now pressure to not just ship features but do it cost-effectively.

One thing I’ve been wrestling with recently is our serverless architecture. While it’s great for reducing upfront infrastructure costs and making scaling easier, it can sometimes be a black box when it comes to monitoring and optimizing spending. Last month, we had a spike in AWS Lambda costs that took some digging to figure out. It turned out one of our microservices was being triggered way more than expected due to an issue with event sourcing.

I debugged the problem by adding detailed logging and then using CloudWatch metrics to track the service’s performance over time. Once I saw the data, it became clear that a specific flow was generating far too many events. We fixed the code, and the costs came down significantly. But this experience really highlighted how much more work is required in DevOps to keep everything running smoothly.

Another area where we’re seeing pushback is on platform engineering decisions. As we transition from traditional monolithic architectures to microservices and serverless, there’s been a shift towards a platform that abstracts away a lot of the complexity for developers. However, this also means they’re less familiar with the underlying infrastructure. We’ve had some debates about whether it’s worth investing time in making them understand more or if we should just focus on building robust abstractions.

Personally, I’m torn. On one hand, I believe that understanding the platform helps developers make better decisions and avoid common pitfalls. On the other hand, I recognize the reality of developer bandwidth and the need to iterate quickly.

Today, as I sit in front of my screen writing this blog post, I can’t help but think about how much has changed in just a few months. AI is everywhere, web assembly is gaining traction on servers, and platform engineering is becoming mainstream. It’s overwhelming at times but also incredibly exciting.

For now, I’m focused on making our systems more resilient and cost-effective. There’s a lot of work to do, but I feel lucky to be part of an organization that values both innovation and pragmatism.

Stay tuned for updates and insights into the journey we’re all embarking on.

— Brandon