$ cat post/the-pager-went-off-/-we-patched-it-and-moved-along-/-uptime-was-the-proof.md

the pager went off / we patched it and moved along / uptime was the proof


Title: Platform Engineering in the Era of AI and FinOps


February 26, 2024 feels like a good day to reflect on where we are in tech. I’ve been dealing with an interesting combination of challenges lately, wrapped up in the broader context of the AI/LLM explosion, platform engineering becoming mainstream, and FinOps pressures. Let’s dive into what it’s like being in this moment.

The LLMs Are Here to Stay

Since ChatGPT hit the scene, the AI landscape has transformed almost overnight. My team is now grappling with how to scale our infrastructure to support a growing number of LLM integrations and use cases. We’re not just running models; we’ve got entire pipelines for training, validation, and inference. It’s exciting but also challenging—especially when you consider that each model iteration can consume thousands of dollars in compute resources.

One recent debug session was particularly illuminating. Our latest model had a habit of going into infinite loops during inference, which was strange because it worked fine on our training clusters. After digging through the logs and profiling data, I realized that the issue lay not with the code itself but with how we were managing concurrency in our Kubernetes pods. It’s a reminder that even in an era of sophisticated AI tools, the basics still matter.

Platform Engineering as a Discipline

Speaking of basics, platform engineering has truly come into its own this year. We’re now a recognized team within the organization, separate from the app developers but closely aligned with their needs. Our mandate is to provide reliable infrastructure and tooling that allows everyone to innovate without worrying about the underlying complexity.

Recently, we’ve been experimenting with WebAssembly on the server side for certain microservices. It’s early days, but I’m excited by the potential it offers in terms of security and performance. The challenge now is making sure our platform engineers are well-versed enough in this new tech to not only implement it but also troubleshoot issues that might arise.

FinOps and Cloud Cost Pressure

The other major factor driving my work right now is FinOps—financing operations. With cloud providers offering increasingly sophisticated billing tools, the pressure to manage costs effectively has never been higher. A few weeks ago, I received a bill from Netlify for over $104k for a simple static site hosting service. The sheer volume of charges was mind-boggling.

After some digging, we discovered that a few rogue team members had enabled some expensive features and didn’t understand the full cost implications. This incident underscored the importance of better cost governance practices across the organization. We’re now rolling out stricter monitoring tools and working closely with finance to ensure everyone has visibility into their cloud spend.

The Developer Experience

Developer experience (DX) is another area that’s gained a lot of traction in recent years. It’s not just about making tools easier to use; it’s about creating an environment where developers can focus on building value rather than firefighting infrastructure issues. We’ve been experimenting with CI/CD pipelines integrated with GitHub Actions and leveraging serverless functions for quick deployments.

One particularly frustrating day, I spent hours trying to resolve a bug in our internal build pipeline. Turns out, it was due to a simple misconfiguration that caused the entire job to fail. While this might seem trivial, it’s moments like these that remind me how much can go wrong when you’re not paying close attention.

Wrapping Up

So there you have it—a snapshot of platform engineering in 2024. It’s an exciting time with so many opportunities and challenges. From AI infrastructure to FinOps, the tech landscape is constantly evolving. For now, I’m just trying to keep up and make sure our team stays ahead of the curve.

Until next time, Brandon


Feel free to edit or expand on any part of this post if you think it needs more depth or a different perspective!