$ cat post/the-monolith-ran-/-i-read-the-rfc-again-/-i-kept-the-bash-script.md

the monolith ran / I read the RFC again / I kept the bash script


Title: Kubernetes Complexity Fatigue and the SRE Paradox


September 30, 2019 was just a few days after I finally shipped that long-awaited feature for our platform engineering team. The feature? A robust and reliable GitOps setup using FluxCD to manage our Kubernetes clusters. We were all exhausted, but also relieved.

The Setup

Back in June, I had to argue hard against the “no new work” sentiment within my team to get this done. We were already dealing with the complexities of running a large number of microservices on Kubernetes and managing it was becoming a full-time job. The cluster state started diverging from our desired states more often than we liked, leading to manual rollbacks and redeploys that could take hours. Our developers were frustrated because their deployments were unpredictable and slow.

We decided to adopt FluxCD as part of an internal developer portal called Backstage. This was part of a broader initiative to centralize our infrastructure management practices and reduce the risk of operational failures. With this, we hoped to make our Kubernetes clusters more resilient and predictable—less like a ship in stormy seas and more like a well-oiled machine.

The Debugging

However, as soon as we launched it, we hit an issue: FluxCD couldn’t reconcile some of our stateful services due to lingering secrets and misconfigured annotations. We spent days digging into logs and configuration files trying to understand why certain pods were failing to start or restart correctly. It turned out that one of the critical service mesh annotations had been accidentally left off, causing a cascading failure.

Debugging Kubernetes issues can be a bit like chasing ghosts in an attic. You think you have everything figured out, but then something else pops up and derails your progress. This was no different. The team worked tirelessly to isolate the problem, but it wasn’t until I had to sit down with one of our SREs for coffee that we found a solution.

The SRE Paradox

Our SRE, let’s call him Mike, had been pushing us hard on DevOps practices and automation. He was always there to remind us about the “SRE Paradox” — the idea that increasing complexity in your infrastructure can actually make it more resilient because you’re less likely to have a single point of failure.

But as we were implementing these complex tools like FluxCD, I couldn’t help but feel a twinge of SRE fatigue. The learning curve was steep, and there seemed to be an endless stream of new tools and practices to keep up with. We were constantly rewriting scripts, adjusting configurations, and refactoring our code to accommodate these new requirements.

Reflections

As we moved forward, I realized that the key wasn’t just about adopting new technologies but about finding a balance between automation and simplicity. FluxCD was great for managing state, but it introduced a level of complexity that required careful management. We needed to ensure that every time we added a new tool or practice, it brought us closer to our goal of reliable and predictable deployments.

The SRE Paradox is real, but it’s also an invitation to rethink how we approach infrastructure. Instead of just adding more automation for the sake of it, we should focus on making sure each step brings us closer to our desired state. It’s about finding the right balance between complexity and simplicity, between the tools we use and the problems they solve.

Looking Forward

As I sit here writing this post, I’m reflecting on how far we’ve come in just a few months. The challenges of managing Kubernetes clusters have only grown as more services are added and scaled. But with tools like FluxCD and a renewed focus on GitOps practices, we’re better equipped to handle the complexity.

The tech world is always moving forward, and it can be overwhelming at times. But I’m glad that I get to work in an environment where we can experiment, learn, and continuously improve. Whether it’s debugging Kubernetes issues or arguing for new tools, every day brings its own set of challenges—and that’s what keeps things interesting.

So here’s to hoping that the next few months bring more success and fewer headaches. Until then, I’ll keep my coffee close at hand and a sense of humor ready.