Kubernetes Complexity: A Lesson in Over-Engineering

May 31, 2021. Kubernetes has been my go-to for container orchestration now for years. I thought it was the pinnacle of orchestration tools—until one day, it nearly bit me.

It started as a routine check on our cluster. We had a new application going into production and were ready to scale up. Our monitoring tool flagged an unusual spike in CPU usage. Normally, this would have been a straightforward investigation—look at the logs, spot the issue, fix it. But something felt different this time.

The first thing I did was run kubectl top pods to see which pod was hogging resources. It pointed me to one of our application nodes that seemed to be doing nothing but consuming CPU. That’s when the hairs on the back of my neck started to stand up.

I dug into the logs, but found no obvious issues—no error messages or unusual activity. I decided to take a closer look at the application itself. After all, it was running in Kubernetes; surely, there must be some odd behavior due to the environment, right?

That’s when I stumbled upon the --kube-api-qps and --kube-api-burst flags. These were set on our service, which was supposed to communicate with the API server. However, they were set to very low values—10 QPS and 20 burst—which meant that even a slight increase in requests could cause issues.

It turned out we had been over-engineering things. In our rush to make sure the application was secure and resilient, we inadvertently created a bottleneck. The application was under constant throttling from its own misconfiguration. Talk about irony!

I made some quick changes—increased the QPS limits—to see if it would fix the issue. Lo and behold, the CPU usage dropped immediately! Our cluster was breathing again.

This experience taught me an important lesson: complexity can arise not just because of poorly designed systems but also due to over-engineering in our approach to solving problems. In our zeal for robustness, we sometimes add layers upon layers of complexity that can hide simple issues.

It’s easy to fall into the trap of adding more and more Kubernetes resources—pods, services, ingresses—to make sure everything is just right. But when something goes wrong, it can be hard to trace back through all these components. This incident made me realize that simplicity and clarity should always be our first goals.

So now, as I type this, I’m looking at my application’s configuration with fresh eyes—trying not to add unnecessary layers without a clear understanding of what each one does. We’re moving towards simpler service meshes and fewer custom controllers in favor of out-of-the-box solutions.

In the world of platform engineering, where complexity fatigue is starting to set in, it’s crucial to keep things as simple as possible while still being robust. Otherwise, we risk making our lives harder than they need to be.

[This blog post reflects my personal experience with Kubernetes and how a seemingly minor misconfiguration led to significant performance issues. The lesson learned is about the importance of keeping systems simple and avoiding over-engineering.]