July 24, 2023 - AI Infrastructure Blues

July 24th, 2023. Sitting in my home office with the sun casting a warm glow through my window, I can’t help but reflect on the tech world as it’s shaping up for another year. The AI landscape is exploding, and platform engineering has truly taken center stage. But like all good things, there are growing pains.

AI Infrastructure Explosion

Just last month, the launch of Llama 2 sent ripples through the industry. As a platform engineer, I find myself spending more time thinking about how to scale these large language models (LLMs) and ensure they can handle high loads without breaking our infrastructure. It’s a challenge that keeps me up at night—how do you serve a model with billions of parameters efficiently while keeping latency in check?

Platform Engineering Mainstream

Platform engineering is becoming mainstream, and it’s clear why. The ability to build and maintain robust, scalable platforms for teams across the organization is no longer just nice to have—it’s essential. At our company, we’ve been working on a new platform called “FlexPaaS” (flexible platform as a service). It’s designed to abstract away complexity for developers so they can focus on building cool stuff rather than managing infrastructure.

But with every new platform comes the challenge of making it bulletproof. I spent weeks debugging an issue where certain API calls would randomly fail, and after hours of tracing and profiling, we discovered it was due to a race condition in our cache layer. The problem turned out to be so subtle that it required multiple rounds of A/B testing before we could fully resolve it.

FinOps and Cloud Cost Pressure

FinOps—financial operations—is becoming more critical than ever as cloud costs continue to rise. We’ve been working closely with the finance team to set up cost centers, budgeting, and monitoring tools. One of our recent projects was building a custom dashboard using Grafana to give us real-time visibility into cloud spend across different teams. It’s eye-opening how much can be saved just by tweaking some configurations or moving workloads between regions.

Developer Experience as Discipline

Developer experience (DX) is gaining traction as its own discipline, and it’s something I’ve been advocating for within the team. I recently argued for a more streamlined development environment setup process using Dev Containers on VS Code. It’s amazing how much friction we can reduce by standardizing our tools and workflows.

DORA Metrics

DORA (DevOps Research and Assessment) metrics are widely adopted now, and they’re forcing us to be more data-driven in our engineering practices. Our latest sprint retros looked at Cycle Time Lag specifically, aiming to reduce the time between when developers finish a piece of work and when it’s actually deployed to production. We’ve seen some initial wins by implementing CI/CD optimizations.

WebAssembly on Server Side

WebAssembly (Wasm) on the server side is an interesting trend. I’m curious about how we can leverage Wasm for edge computing tasks where performance really matters. One of our experiments involved running a simple image processing task using Wasm. It’s fast, and it works surprisingly well, but there are still kinks to work out around compatibility and security.

The Zenbleed Bug

Speaking of security, the Zenbleed bug hit hard last month. As I was debugging an issue in our monitoring stack, this obscure buffer overflow vulnerability popped up on my radar. It’s a stark reminder that no matter how many times we test, there are always edge cases to consider.

Conclusion

So here I am, reflecting on another month of building, arguing, and learning. The tech world is moving fast, but I’m excited about the opportunities it presents. Whether it’s LLMs or Wasm, every challenge brings us closer to creating better systems for everyone.

Stay tuned as we continue to push the boundaries!

Feel free to tweak any part of this post to fit your personal experiences and reflections!