$ cat post/kubernetes-complexity:-a-personal-dive-into-the-chaos.md

Kubernetes Complexity: A Personal Dive into the Chaos


March 18, 2019. I remember it like yesterday. Kubernetes was a hot topic, but complexity fatigue was starting to set in. The promise of declarative configuration and self-healing systems was there, but the reality of managing clusters with multiple services felt overwhelming.

Today, I find myself looking back at that period through the lens of hindsight, both on my blog and as someone who’s been deeply immersed in this space for years now.

Last week, we faced a particularly gnarly issue. One of our microservices, built using Kubernetes, was flapping like crazy. CPU utilization spiked unpredictably, and logs showed strange behavior that didn’t make sense. The service handled user sessions—no small task, given the number of users it needed to support.

I dove into the cluster’s metrics first, hoping for some clues. Istio’s tracing tool helped a bit, showing spikes in request latency from specific regions. But this wasn’t enough; I needed to dig deeper.

The next step was to enable debug logs for the service, but that alone didn’t give me the full picture. The problem lay elsewhere—there were issues with how resources were being allocated and requests were being handled. It turned out a recent update in our Kubernetes cluster’s resource management had caused some pods to be under-allocated, leading to excessive retries and thus the flapping.

This kind of issue is not uncommon when dealing with a large-scale Kubernetes deployment. The tools like kubectl, Helm, and even newer tools like ArgoCD are powerful but require careful handling. Every update or change can introduce subtle bugs that you only see under load.

I’ve spent countless hours arguing about the best practices for managing our clusters. Should we stick to a monolithic approach with Helm, or embrace GitOps with Flux? The trade-offs between ease of deployment and operational flexibility are always on my mind. Today, I still lean towards GitOps because it provides better traceability and automated rollbacks.

The recent popularity of eBPF is fascinating. It’s like adding superpowers to our devops tools. With the ability to directly manipulate the kernel from user space, we can do things like optimize network performance or debug issues without ever restarting a service. But it’s not a silver bullet—it requires understanding of both the application and the underlying system.

As I reflect on this period, it’s clear that while Kubernetes has simplified many aspects of cloud-native deployments, it also adds layers of complexity. Debugging issues often involves tracing through multiple layers—networking, application code, container runtime, and the OS itself. Tools like Prometheus, Grafana, and Kubernetes Dashboard are lifesavers, but they’re only as good as the data you feed them.

The tech community is buzzing with new tools and services, from Firefox’s file transfer service to Spotify’s push for fairer practices in streaming music. But our focus remains on building robust infrastructure that can handle growing demands without breaking down.

In today’s world of remote-first infra scaling, it’s more critical than ever to ensure our systems are reliable and maintainable. The tools we use should enhance our ability to deliver value quickly while maintaining the quality expected by users.

As I write this, Kubernetes continues to evolve, and with each version, the complexity changes shape. But one thing remains constant: the journey of debugging, learning, and improving never ends.


This reflection isn’t just about a specific incident; it’s a reminder that our work as engineers is an ongoing process. Whether we’re dealing with Kubernetes or any other technology, the challenge lies in understanding the systems deeply enough to tame their complexities.