$ cat post/the-deploy-pipeline-/-i-traced-it-to-the-library-/-the-daemon-still-hums.md

04MAR19

the deploy pipeline / I traced it to the library / the daemon still hums

Title: Kubernetes Complexity Fatigue Hits Me Hard

March 4, 2019. It’s been a long day in the world of platform engineering and SREs. I’ve just stepped out of a meeting where we were discussing how to tackle the complexity that Kubernetes has brought into our infrastructure. The reality is, it feels like the honeymoon phase with Kubernetes is over for many teams—my own included.

A Bit of Background

For those who aren’t in the know, 2019 was when Kubernetes had officially matured enough to be a core part of most organizations’ tech stacks. The learning curve wasn’t just about understanding how to use it, but also about managing its complexity at scale. Our team was facing issues like managing multiple namespaces, dealing with the ever-growing number of custom resource definitions (CRDs), and ensuring robust observability and monitoring.

A Debugging Adventure

The other day, I spent a solid hour trying to track down an issue that had been causing my app to fail every now and then. It wasn’t just one pod; it was multiple pods in different namespaces, all experiencing the same problem simultaneously. At first, I thought it might be some misconfiguration or a bug in our application code. But as I dug deeper, I realized it was something more fundamental.

It turned out to be an issue with how we were handling secrets and configurations across our namespaces. We had multiple ways of managing these secrets—some through Kubernetes Secrets, others via environment variables passed at runtime, and some even hardcoded in our code. The chaos was compounded by the fact that different environments (dev, staging, prod) often had different setup strategies.

Enter Flux

That’s when I turned to Flux. We were already using it for our GitOps approach, but as we started to scale, the complexity of maintaining state and ensuring consistency across multiple namespaces became overwhelming. The idea behind Flux is simple: use Git as a source of truth for your Kubernetes configurations. But in practice, it was a nightmare.

We had to set up separate Git repos for each namespace, manage branch policies, and ensure that our team members understood how to apply changes without breaking anything. It wasn’t just about the tool; it was about changing our entire workflow.

SRE vs. DevOps

As I worked through this issue, my thoughts inevitably turned to the role of SREs versus traditional DevOps engineers. In the past, we had been focused on delivering fast and often with minimal oversight. But as complexity grew, it became clear that a more structured approach was needed. SRE principles—like those espoused by Google—were starting to make sense in our context.

SRE isn’t just about running operations; it’s about understanding the system at a deep level. It’s about ensuring that your infrastructure can handle unexpected loads and failures gracefully. For us, this meant rethinking how we manage secrets and configurations across namespaces. We needed something more than just tools like Flux—it required a cultural shift.

Reflections

As I wrap up my thoughts for today, I find myself reflecting on the journey. Kubernetes has been an incredible tool that has enabled us to build incredibly complex systems with relative ease. But it comes with its own set of challenges. The complexity fatigue is real, and it’s something we’re all grappling with.

For now, I’m leaning into the SRE mindset, trying to find a balance between automation and human oversight. Flux will be part of that equation, but not the only piece. It’s about creating systems that are resilient and maintainable in the long term.

Stay tuned as I continue to navigate these waters and share my learnings along the way.

This post is my attempt at capturing some of the struggles and thoughts around Kubernetes complexity during a pivotal time for platform engineering. Let’s hope for better days ahead!