$ cat post/make-install-complete-/-the-version-pinned-to-never-/-i-kept-the-old-box.md

08APR24

make install complete / the version pinned to never / I kept the old box

Navigating the AI Labyrinth

April 8, 2024

Today, I find myself grappling with another chapter in our platform’s journey. The landscape of technology is ever-evolving, and it feels like we’re navigating a complex maze where every turn leads to new challenges and opportunities. This post is going to be about the AI/ML infrastructure explosion that’s been keeping me busy over the past few months.

ChatGPT Aftermath

The aftermath of the ChatGPT hype has settled into a steady state, but the reverberations are still felt across the industry. We’ve seen a proliferation of new language models, each vying for attention and adoption. Our platform engineering team is no stranger to this storm; we’re currently working on integrating Meta Llama 3 (yes, that’s its codename) into our ecosystem.

The challenges are daunting. Meta Llama 3 brings incredible capabilities, but also a lot of complexity in terms of infrastructure requirements. We’ve had several sleepless nights figuring out how to manage the compute and memory needs without breaking the bank. After all, with DORA metrics being widely adopted, we can’t afford any bottlenecks.

Platform Engineering Mainstream

The rise of platform engineering has been a game-changer. It’s no longer just about building tools for developers; it’s about creating an environment where they can be as productive and innovative as possible. We’re seeing more teams adopt this approach, which means we need to evolve our practices accordingly.

One recent argument in the team was about whether to build or buy infrastructure services. I argued that while off-the-shelf solutions are tempting due to their ease of use and integration, they often come with hidden costs and limitations. In the end, we decided to take a hybrid approach—leveraging some managed services for common needs but maintaining control over critical components.

WebAssembly on Server Side

WebAssembly (Wasm) is another area where I’ve been diving deep. The ability to run compiled binaries in the browser has opened up exciting possibilities. But when it comes to running Wasm on the server side, there are still a lot of questions around performance and security. We’re exploring tools like Deno and WasmCloud to see if they can provide a robust solution for our use cases.

One recent project involved building an experimental service that needed real-time processing capabilities. We initially considered using Node.js with Wasm modules, but the latency was too high. After some back-and-forth discussions, we decided to go with Rust and Wasm to get better performance. The learning curve was steep, but it paid off in the end.

FinOps and Cloud Cost Pressure

FinOps—financial operations—is a discipline that’s becoming increasingly important as cloud costs continue to rise. We’re implementing cost controls and monitoring tools to ensure that we’re not overspending on infrastructure. This involves setting up detailed budgets, optimizing resource usage, and regularly reviewing our spending patterns.

The other day, I spent several hours debugging an unexpected spike in costs that turned out to be due to a misconfigured autoscaler. It was a humbling experience, reminding me that even experienced engineers can make mistakes. After fixing it, we implemented additional checks to catch similar issues in the future.

Developer Experience

Developer experience (DX) is another area where I’ve been focusing a lot of my efforts. With platform engineering becoming more mainstream, DX has become table stakes for success. We’re working on improving our CI/CD pipelines and documentation to make it easier for developers to onboard and contribute effectively.

One recent change we made was adding better support for GitOps practices. This involved integrating tools like Helm and Flux into our infrastructure. It wasn’t without its hiccups—initially, the learning curve was steep, but over time, we’ve seen a noticeable improvement in how quickly new team members can get up to speed.

Conclusion

As I sit here reflecting on the past few months, it feels like we’re at an inflection point. The tech landscape is evolving rapidly, and staying ahead of these changes requires constant vigilance and adaptability. Whether it’s dealing with the intricacies of AI/ML infrastructure or improving developer experience, every day brings new challenges.

For now, I’m content to keep pushing forward, learning from each setback and celebrating each success. The journey continues, and I can’t wait to see where it takes us next.