Kubernetes Complexity Fatigue Meets Remote Work

April 6, 2020 was just a few days before my company shifted entirely to remote work. I remember the day vividly because of how much it marked an inflection point for both technology and personal life.

As an engineering manager with a background in platform engineering, Kubernetes had become a major part of our infrastructure over the past couple of years. But by late 2019, there was a growing sense that Kubernetes complexity was starting to bite us. We were dealing with issues like managing multiple clusters, handling stateful workloads, and ensuring cluster stability. Every day seemed to bring a new challenge.

One particular incident stands out. We had just finished deploying our application across two separate Kubernetes clusters for redundancy. Everything looked good on the surface, but we quickly ran into an issue where a service was intermittently failing to start up properly in one of the clusters. The logs were cryptic and the underlying cause wasn’t immediately obvious.

After days of digging through documentation, trying out different troubleshooting techniques, and coordinating with our operations team, we finally pinpointed the problem: a misconfiguration in one of the cluster’s network policies. It was frustrating—what should have been a straightforward deployment ended up being an exercise in patience and persistence.

But that wasn’t the only complexity we faced. With more developers working from home due to a looming pandemic, we needed to ensure our infrastructure could scale remotely. This meant setting up secure remote access, ensuring our internal developer portal (Backstage) was accessible, and making sure all our tools worked seamlessly over slow or unreliable connections.

On this day in April, as the world seemed to be closing down around us, I found myself wrestling with a different kind of complexity: how to maintain productivity while working from home. The tools we had for managing Kubernetes clusters were great in an office environment, but they struggled with the new remote-first reality. We started looking more closely at tools like ArgoCD and Flux GitOps, hoping they could help us streamline our deployment processes and reduce the friction caused by manual steps.

It was a time of both challenge and opportunity. While we were dealing with the immediate issues of remote work and Kubernetes complexity, there was also a chance to rethink our approach. We started exploring eBPF as a way to gain deeper insights into our system performance without relying heavily on traditional monitoring tools. The idea of using eBPF intrigued me because it offered a new level of visibility that we hadn’t had before.

In the end, what mattered most was adapting to the changing landscape. We learned to embrace a more asynchronous workflow and found ways to keep everyone engaged despite being physically apart. It wasn’t always easy, but looking back, those early days laid the groundwork for our team’s resilience in the face of challenges.

As I look at my calendar today, I see a reminder from that time: “Learn eBPF.” Maybe one day, that will finally happen. For now, though, it’s about staying agile and ready to tackle whatever comes next—whether it’s Kubernetes complexity or remote work obstacles.