August Reflections: From M1 GPUs to ChatGPT's Aftermath

August 28, 2023

Just another day at the office… or should I say, another day dealing with the fallout from the AI/LLM infrastructure explosion that followed ChatGPT’s launch. As someone who’s spent a career navigating the ups and downs of tech trends, it feels like everything is moving so fast that sometimes I’m not sure where to look first.

The M1 GPU Driver

A while back, we started testing out the first conformant M1 GPU driver for our platform infrastructure. Let me be clear: this wasn’t an easy task. Apple’s M1 silicon was a leap into the unknown, and ensuring that our applications ran smoothly on it took some serious debugging. We hit all sorts of issues with rendering artifacts, performance bottlenecks, and even a few cases where the GPU just wouldn’t power up correctly.

Debugging these kinds of problems is always a challenge, but one thing I’ve learned over the years is that you can’t rush perfection. It’s about iterating and learning from every failure until everything works as it should. We spent countless nights in our data center, staring at cryptic kernel logs, trying to figure out why certain shaders weren’t working.

FinOps and Cloud Cost Pressure

Speaking of challenges, we’re dealing with increasing pressure from FinOps (financial operations) teams. It’s no longer just about building cool stuff; it’s about doing so in a way that doesn’t break the bank. Every cloud instance, every bit of infrastructure, needs to be justified not just by its utility but also by its cost. DORA metrics are now mainstream, and we’re constantly being evaluated on our ability to ship features quickly while keeping costs under control.

One day, I had an argument with a FinOps team member about the cost savings from using spot instances versus dedicated hosts. The FinOps guy was convinced that spot instances were too risky, but I argued that with proper monitoring and failover mechanisms in place, they could save us significant money without compromising on uptime. In the end, we compromised by setting up a hybrid model where critical components stayed on dedicated hosts while more flexible parts used spot instances.

The ChatGPT Aftermath

ChatGPT’s launch was like throwing a bucket of water into an already boiling pot. Suddenly, everyone wanted to jump on the AI bandwagon. We had to quickly scale our infrastructure to handle an influx of requests that were both unpredictable and resource-intensive. It wasn’t just about setting up more servers; we needed to rethink how we managed memory, network I/O, and even storage.

We started using WebAssembly (Wasm) on the server side for some of these AI-related tasks. Wasm was gaining traction as a way to run arbitrary code in the browser, but why not use it on the backend too? It allowed us to offload some heavy lifting from our main application servers and reduce overall latency. We also experimented with containerizing our services more aggressively, using tools like Podman and CRI-O to manage these containers dynamically.

Developer Experience

On a lighter note, developer experience has become a serious discipline in itself. It’s not just about writing clean code; it’s about making sure the entire development lifecycle is smooth. We’re using platforms like OpenTF (which recently announced a fork of Terraform) to standardize our infrastructure as code practices. The idea is that everyone should be able to provision and manage resources with minimal friction.

We’ve also been investing in better CI/CD pipelines, using tools like GitHub Actions and GitLab CI. These have helped us automate deployments, testing, and even security scans. It’s a never-ending process of tweaking and optimizing these workflows, but the payoff is that developers can focus more on writing code rather than dealing with infrastructure minutiae.

Wrapping Up

August 2023 was a whirlwind of changes, from M1 GPU drivers to AI scaling challenges. But through it all, one thing remains constant: the importance of staying agile and continuously learning. The tech landscape is always shifting, but by embracing new tools and methodologies, we can keep moving forward.

Until next time,

Brandon