Kubernetes Complexity Fatigue: A Real-World Perspective

September 9, 2019. I woke up to the usual emails and Slack messages, but something felt different this morning. The recent wave of articles about “Kubernetes complexity fatigue” had left me feeling a bit more introspective than usual.

Let’s be honest—back in late 2018, we were all excited by Kubernetes as it promised a way to simplify container orchestration. It seemed like every tech blog was talking about how easy it would be to manage containers with this magical tool. Fast forward a year, and now folks are starting to say “Kubernetes is hard” and that “clusters are like Swiss cheese.” I’ve been through the highs and lows of managing Kubernetes clusters, so today, I want to share some of my experiences.

The Beginning

In early 2019, we launched our internal platform engineering team. One of our primary goals was to make it easier for developers to deploy applications. We decided that Kubernetes would be a central part of this solution. For the first few months, everything seemed great. Our dev portal (Backstage) was getting positive feedback, and deployments were smooth.

The Honeymoon Ends

But as time went on, the challenges started piling up. We found ourselves spending more and more time debugging pod crashes, service disruptions, and network issues. Every new feature request meant another layer of complexity to manage in our clusters. And that’s when I realized—Kubernetes is not just about setting up a few pods.

We had to create custom resources for various services, write custom Kubernetes operators, and deal with the constant evolution of Kubernetes itself. We were learning as we went, but it wasn’t enough. Our team was growing, and so were our clusters. Managing more than 50 separate clusters became a nightmare.

The SRE Touchdown

Around this time, the SRE (Site Reliability Engineering) roles were starting to proliferate in tech companies. It made sense for us to bring on an SRE specialist who could help us with cluster management and stability. Our new team member, Alex, brought fresh perspectives and a wealth of experience from other Kubernetes-heavy organizations.

Together, we started refactoring our setup to make it more manageable. We implemented RBAC (Role-Based Access Control) properly, improved logging practices, and added monitoring for critical services. Slowly but surely, the stability of our clusters improved.

The Boring Technology Behind a One-Person Internet Company

One day, while reading “The boring technology behind a one-person internet company,” I couldn’t help but think about how much our setup had become a mess. The article talked about how simple and straightforward infrastructure can be when it’s just for you. But for us, the complexity was undeniable.

Embracing eBPF

As we looked to simplify our systems further, one technology caught my attention—eBPF (extended Berkeley Packet Filter). It promised more efficient network monitoring and troubleshooting without the overhead of traditional tools. We started experimenting with it in some of our staging clusters and were pleasantly surprised by its performance.

Looking Ahead

Looking back on 2019, I can see how much has changed. The buzz around serverless was still strong, but the reality is that for many companies, Kubernetes remains a critical piece of infrastructure. We’ve learned to embrace the complexity with better tools and practices, but it’s clear that simplicity isn’t just a buzzword.

In my next post, I’ll talk more about how we’re using eBPF in our production environments. Until then, if you have any thoughts or experiences to share, feel free to hit me up on Twitter.

That’s the state of Kubernetes complexity fatigue for us in 2019. It’s been a rollercoaster ride, but I’m excited to see where we go next.