$ cat post/packet-loss-at-dawn-/-the-version-pinned-to-never-/-the-build-artifact.md

packet loss at dawn / the version pinned to never / the build artifact


Dealing with DevOps Drama in a World of LLMs


September 2, 2024. The air is thick with the buzz of AI and the hum of servers. Just like last year, we had a few hiccups and a lot of learning. One day, I found myself debugging something that seemed so trivial yet was gnawing at me—a simple Lambda function that suddenly stopped working.

The Setup

We’ve been ramping up our use of serverless functions with AWS Lambda, part of the ongoing trend towards platform engineering. This particular function was serving as a bridge between our internal messaging system and a third-party service. It’s not rocket science, just some middleware to pass along messages. But it had stopped working.

The Initial Hunch

First thing I did was look at the logs. They were unhelpfully silent. Nothing in CloudWatch suggested any issues with timeouts or errors. This is where things got interesting—because this wasn’t a brand new piece of code, but something that we’d been running smoothly for months.

The Investigation

I dove into the codebase, which was written in Python and deployed using AWS SAM (Serverless Application Model). I went through each line, but nothing seemed out of place. The dependencies were up to date; the service calls looked straightforward enough. But then I remembered something—a recent change we made to how we handle environment variables.

In our rush to standardize on a new configuration management tool, we had updated the way environment variables were loaded in our Lambda functions. Could it be that? I double-checked and triple-checked my assumptions—surely this couldn’t have broken everything, right?

The Lightbulb Moment

It was one of those moments where you just realize something obvious. In a rush to update our code, we hadn’t properly documented the changes, or more importantly, tested them thoroughly. The issue turned out to be an environment variable that wasn’t being set correctly due to some subtle change in how we sourced it.

The Fix

Fixing it was anticlimactic—just adding back the correct environment variable and making sure our test suite included a case for this scenario did the trick. But it left me thinking about the broader implications.

FinOps and Cloud Cost Pressure

With the rise of AI, everyone wants to be smarter, faster, and more efficient. But with that comes the pressure to optimize costs. In my experience, FinOps (financial operations) teams are getting much more involved in platform decisions—every penny counts when you’re running thousands of Lambda functions.

Developer Experience as a Discipline

As we continue to push towards developer-friendly infrastructure, I’ve been thinking a lot about how to streamline the onboarding process for new team members. We’ve got CI/CD pipelines set up and a robust documentation system, but there’s always room for improvement. Maybe it’s time to formalize our approach to DevOps best practices.

WebAssembly on Server-Side

There’s been some talk around WebAssembly (Wasm) being used in server-side applications. While I’m skeptical of the hype, Wasm does offer interesting possibilities for running compiled code more efficiently than traditional interpreted languages like Python or JavaScript. It could be a game-changer if we can figure out how to leverage it without adding complexity.

Platform Engineering

Platform engineering has become mainstream, and with good reason. We’re not just deploying code anymore; we’re building ecosystems that support multiple teams across the organization. But this comes at a cost—more moving parts mean more things to go wrong. The key is finding the right balance between abstraction and simplicity.

Reflections

In the end, it was just another day in tech. We debugged something simple but learned valuable lessons along the way. The world of AI and platform engineering keeps throwing us curveballs, but at least we’re getting better at handling them.

So here’s to another round of ops drama—and a little more knowledge under our belts.


That’s it for today. Back to the trenches!