$ cat post/june-2022-reflections:-ai-infrastructure-overload,-platform-engineering-norm,-and-finops-reality.md

06JUN22

June 2022 Reflections: AI Infrastructure Overload, Platform Engineering Norm, and FinOps Reality

June 6th, 2022. A month that felt like the era of tech was being rewritten right in front of us. AI infrastructure was exploding with every tweet about ChatGPT, platform engineering was becoming a recognized track within companies, and everyone seemed to be talking about FinOps, cloud cost pressure, and DORA metrics. This was my reality check-in.

On one hand, I had just led a team through the deployment of our new AI backend that could handle twice as many requests with half the resources. We were using an emerging stack: TFE (Terraform Enterprise) for CI/CD, Pulumi for infrastructure as code, and Kubeflow for the MLOps pipeline. The excitement was palpable—this was cutting-edge stuff! However, the reality hit hard when we started seeing high cloud costs after a few days of heavy usage.

“Wait,” I thought, “we’ve got FinOps down pat.” That’s what our CTO had told us, right? But as I dug into the AWS bill, it was clear that there were some inefficiencies in how our resources were being used. We needed to get better at understanding and managing our cloud spend.

The conversation shifted from “How do we scale?” to “How much are we paying for this scaling?” It was a stark reminder of why FinOps had become such a buzzword—companies couldn’t afford to be wasteful anymore, especially in an era where every dollar could mean the difference between staying competitive or falling behind.

Platform engineering also became more mainstream. At our company, platform teams were starting to form, and with them came new challenges. How do you build a scalable and maintainable infrastructure that supports multiple applications without becoming a single point of failure? It’s easy to say, “Just use Infrastructure as Code,” but the devil is in the details. We had to grapple with how to balance simplicity for developers with robustness and reliability.

During one particularly heated meeting, we discussed whether to standardize on a specific cloud provider or allow engineers more flexibility to choose based on project needs. The argument for flexibility seemed compelling—why limit ourselves when different tools might fit better? But the counter-argument about vendor lock-in was just as strong. In the end, we agreed to stay neutral and let projects vote with their feet.

Meanwhile, I was wrestling with a tricky issue in one of our services that kept crashing under heavy load. It turned out to be a WebAssembly module on the server side. The code was clean, but it seemed like every few requests would cause a segmentation fault. Debugging this involved pouring through logs and trying to replicate the conditions that led to the crash. In the end, I realized that we needed better testing tools for these types of modules.

One day, while reviewing some code from our latest project, I came across an interesting piece: “Show HN: A friend and I spent 6 years making a simulation game, finally released.” Reading this brought back memories of late-night coding sessions, debugging, and the satisfaction of finally shipping something. It made me reflect on how much of my career was dedicated to solving problems and delivering value, even if it meant putting in long hours.

And then there was that Hertzbleed attack—another reminder of just how many vulnerabilities were out there waiting for someone to exploit them. The thought kept nagging at the back of my mind: “What if our code has a similar vulnerability?” It’s these types of threats that keep us on our toes and make sure we don’t become complacent.

As June drew to a close, I found myself reflecting on all this activity—AI infrastructure growth, platform engineering trends, FinOps realities. The tech landscape was evolving rapidly, and so were the challenges we faced as engineers. But one thing remained constant: solving problems day in and day out.

Until next time, keep debugging!