Dealing with DORA Metrics in a Land of Overwhelming CNCF Choices

Today marks the 17th of June, 2024. As I sit down to reflect on this month, my mind is brimming with thoughts about DevOps and Platform Engineering. The past few weeks have been intense, particularly given the latest developments like ChatGPT’s AI/LLM infrastructure explosion and the ongoing conversations around FinOps and cloud cost pressure. Let’s dive into some of the real ops and infrastructure challenges I faced this month.

The Era of AI/LLM Infrastructure

This year, every conversation revolves around AI and LLMs. From ChatGPT to various other players entering the space, it’s clear that large language models are transforming how we think about software development and platform engineering. One day, during a team meeting, we started discussing how to integrate some of these new tools into our stack—specifically, how we could use AI for better code reviews or as an assistant in our daily DevOps tasks.

However, the reality hit us hard when one of my junior engineers came to me with a report about a misconfigured LLM that accidentally pushed sensitive customer data into production. That was quite a wake-up call! The pressure is on not just to keep up but to ensure that these new tools are used responsibly and securely.

FinOps and Cost Pressure

On the cost side, our finance team has been pushing hard for better transparency. They’ve introduced DORA metrics—deployment frequency, lead time for changes, change fail percentage, and mean time to recovery. These metrics are great in theory but can be a pain in practice. For instance, every push we make now is scrutinized more than ever before.

One of our recent projects involved migrating a large microservices-based application from AWS Lambda to a more cost-effective solution like Google Cloud Functions. The initial plan was straightforward: just swap out the provider and hope for the best. But as soon as we started tracking DORA metrics, things got complicated. Every deployment now has a higher bar, requiring extensive testing and thorough documentation. It’s not just about getting stuff done; it’s about doing it efficiently and keeping our costs under control.

CNCF Landscape and Platform Engineering

The CNCF landscape is overwhelming right now. With every new project or service we start, there are so many options—Kubernetes, OpenTelemetry, Jaeger for tracing, Prometheus for monitoring, and the list goes on. It’s like trying to choose between all these delicious flavors of ice cream without knowing which one will pair best with your dessert.

One particular argument I had with a colleague was about choosing between different observability tools. We were debating whether to stick with Prometheus or explore other options like Thanos. The discussion went back and forth for days, with pros and cons from every angle. In the end, we decided to use both—a Prometheus setup as our primary monitoring system and thanos for long-term retention and analytics. It’s a complex decision, but sometimes you just need a little of everything.

WebAssembly on Server-Side

On another front, there has been growing interest in using WebAssembly (Wasm) on the server side. The idea is to run compiled languages like Rust or C++ directly within the browser environment, which can significantly boost performance for certain tasks. My team and I started experimenting with this by building a small proof-of-concept where we used Wasm to offload some heavy computation from our Node.js application.

The initial results were impressive, but as expected, there were plenty of challenges. Compatibility issues between different browsers, debugging difficulties, and the fact that not all developers are familiar with Wasm syntax—it’s been quite a learning curve. We’re still in early stages, but we’re optimistic about its potential to improve our platform.

Conclusion

Looking back at June 2024, it feels like I’ve spent most of my time navigating these complex landscapes—balancing AI and LLM integration with FinOps metrics, deciding on the best CNCF tools, and experimenting with Wasm. It’s a lot to handle, but that’s what makes this job so rewarding. The constant learning curve is exhilarating.

As the month comes to an end, I find myself reflecting on how these challenges will shape our future work. Will we see more automation in deployment processes? How will AI change our roles as platform engineers? Only time will tell, but one thing is certain: this field is far from static, and I’m looking forward to whatever comes next.

Stay tuned for the next update!