Kubernetes Complexity Fatigue: My Fight Against Configuration Hell

May 3, 2021. I woke up to yet another morning of wrestling with Kubernetes configuration files. I’m sure I’m not alone in feeling this way, given the current state of affairs.

The last few months have been a rollercoaster as we’ve started moving more and more services onto our platform using Kubernetes. It’s amazing what you can do with it—deployments, rollouts, all that jazz. But let me tell you, managing these things gets real fastidious once you get into production work.

I was trying to deploy a new service today, but the kustomize configurations and overlays were giving me fits. I found myself constantly checking the documentation, looking for that one elusive tweak that would make everything just work as intended. It’s like a game of whack-a-mole where every time you think you’ve solved one issue, three more pop up.

One of my coworkers mentioned Backstage, which is all the rage right now. It’s great for providing internal developer portals and visualizing our infrastructure in a friendly way, but I started wondering if it could help with some of this Kubernetes pain. Maybe we can create a centralized dashboard to manage our deployments, rollouts, and services. The idea was tantalizing—reduce the complexity by standardizing everything.

I began tinkering around with Backstage, trying to integrate it into our existing workflows. It was a good start, but I quickly hit some walls. For one thing, our team is pretty small, so the learning curve for setting up Backstage seemed steep. Plus, while it’s amazing for providing an overview of what’s happening in our infrastructure, it doesn’t help much with the day-to-day grunt work.

I also started reading about Argo CD and Flux GitOps, which are gaining traction. They seem like they might be a better fit because they take the YAML files out of the equation by using Git to manage Kubernetes resources. The idea is that you commit changes to a Git repo, and then an automated system takes care of applying those changes. That’s a game changer if it works as advertised.

But as with everything else in this space, there are pitfalls. We’ve run into issues where our configuration files weren’t quite right, leading to unexpected behavior when we tried to apply them via Argo CD. It’s been a bit of a trial-and-error process, and I’m still not sure if it’s the silver bullet we’re looking for.

One thing that hasn’t changed is my love for eBPF. The tools available in this space are getting more robust, and there’s an ongoing interest in what eBPF can do to improve our monitoring and debugging capabilities. I’ve been playing with a few eBPF programs recently to see if they could help us get deeper insights into some of the performance bottlenecks we’re encountering.

As much as I’m excited about all these new tools, there’s also a sense of weariness. Kubernetes has become so complex that it feels like every time you solve one problem, ten more pop up in its wake. And while SRE roles are proliferating, the sheer volume of work can be overwhelming.

On top of all this, we’re still grappling with the reality of remote-first infrastructure scaling due to COVID-19. We’ve had to adjust our ops practices to fit the new normal, and it’s not always smooth sailing. The network latency issues and the reliability challenges are real—especially when you’re dealing with services that need to be highly available.

So here I am, fighting this battle every day. It’s a marathon, not a sprint. I’ve been thinking about taking some time off soon just to clear my head. Maybe I’ll even quit my job for a while and focus on SerenityOS full-time—like one of those HN stories made me chuckle about.

But until then, I’m going to keep plugging away at this Kubernetes mess. Because in the end, it’s not about avoiding work; it’s about doing what you have to do with the best tools you can find and hoping that they make your job a little easier every day.

Until next time,

Brandon