$ cat post/the-config-was-wrong-/-the-database-was-the-truth-/-i-saved-the-core-dump.md

09DEC19

the config was wrong / the database was the truth / I saved the core dump

Title: Kubernetes Complexity Fatigue: My Journey Through a Cluster’s Woes

December 9, 2019. The date looms like an old friend in my calendar, a day when the weight of our growing cluster felt particularly heavy. It was one of those days where I had to dig deep and get real with Kubernetes’ quirks.

We had been running our services on Kubernetes for a couple of years now, but as we scaled out, the complexity just kept piling up. Pods were restarting, deployments were failing, and logs were a never-ending sea of noise. The team was feeling the strain—frustration simmered under the surface like unboiled water.

One particular day, I found myself staring at a series of pod crashes in my dashboard. The logs suggested it might be an issue with our custom application code, but there was something off. The timestamps and messages seemed too… clean? A bit too sanitized for actual production issues. This wasn’t the first time we had seen these types of false positives, but this time, I decided to dive deeper.

I started digging into the network stack, which led me down a rabbit hole of eBPF (extended Berkeley Packet Filter). For those who might not know, eBPF is becoming quite popular for deep packet inspection and tracing in containerized environments. It’s like having a superpower in your debugging toolkit, but it can also be a bit overwhelming.

As I spelunked through the kernel tracepoints using bpftool and cilium, I realized there was an underlying issue with our network configuration. A misbehaving CNI (Container Network Interface) plugin was causing packets to get dropped, leading to those clean crashes in the application logs. It turns out that our network setup wasn’t as rock-solid as we had hoped.

The fix involved updating the CNI plugin and adding some additional logging to help us catch issues like this earlier in the future. It’s a small win, but one that felt like a breath of fresh air after days of debugging.

Meanwhile, I was also keeping an eye on the ArgoCD and Flux GitOps tools. These were maturing nicely, providing more robust ways to manage our clusters’ state and configurations. However, they came with their own set of challenges—like dealing with Kubernetes RBAC (Role-Based Access Control) issues and ensuring that everyone in the team understood how to use them effectively.

One evening, while working late, I was going through a GitOps deployment when I stumbled upon an interesting SRE (Site Reliability Engineering) discussion. The concept of “SRE” had been around for years, but it felt like it was really starting to take shape in my organization. There was talk about formalizing platform engineering and internal developer portals—Backstage being one example that seemed promising.

But as I sat there late at night, thinking about all the moving parts, I couldn’t help but feel a bit overwhelmed. The complexity of our Kubernetes cluster was just too much for a single person to handle. It felt like every time we solved one problem, three more popped up in their place.

This led me to think about some advice from a hacker news article that resonated with me: “Learning at work is work, and we must make space for it.” As the new year approached, I resolved to carve out some time for myself to explore some of these tools further. Maybe try out eBPF in more depth or dive into some SRE best practices.

In a way, this was just the beginning of a long journey. Kubernetes will continue to evolve, and with it, our approach to managing infrastructure will need to adapt as well. But for now, I’ll focus on making small improvements, one pod at a time.

That’s my day in review, folks. Complexity fatigue is real, but so are the opportunities to learn and grow. Here’s to another year of challenges and victories in the world of platform engineering!