Kubernetes Complexity Fatigue and the Art of Keeping it Simple

October 5, 2020 was a day I could easily forget amidst the constant rush to keep our platform running smoothly. It had been a few weeks since we rolled out ArgoCD as part of our CI/CD pipeline, and while it promised a lot of automation and simplicity, the reality of managing Kubernetes complexity was starting to set in.

The day began with a flurry of alerts from our monitoring systems. A couple of our services were showing degraded performance, and I found myself diving into logs to figure out what had gone wrong. As I scrolled through the verbose output, I couldn’t help but feel that Kubernetes, while incredibly powerful, was also one of the most complex tools we’ve ever worked with.

I sat down at my desk, trying to understand why our app was running slower than usual. It wasn’t just a matter of fixing a single line of code or tweaking a configuration; this required digging through YAML files and understanding how our services were orchestrated across multiple nodes. The ArgoCD deployment had introduced a new layer of complexity that I hadn’t fully appreciated.

I opened up the Backstage portal, hoping to find some clues about recent changes in our infrastructure. It’s amazing what an internal developer portal can do for visibility—back when we first set it up, we thought it would be useful for onboarding and documentation, but now it serves as a centralized hub for monitoring, service management, and more.

As I navigated through the portal, I noticed a few issues flagged by the SRE team. They were tracking down some flakiness in our database connections, which seemed to coincide with the recent Kubernetes deployment. It was a reminder that even when things look smooth from the outside, the underlying infrastructure can still cause headaches.

The day dragged on as I worked through these issues, but it wasn’t just about debugging. There was an underlying feeling of complexity fatigue starting to set in. The number of moving parts in our system had grown significantly, and while tools like ArgoCD were supposed to make life easier, they also introduced new challenges.

Around lunchtime, the team joined me for a quick stand-up. We talked through the issues we were facing and shared some ideas about how to address them. One idea was to start breaking down our services into smaller, more manageable components. Another suggestion was to revisit our logging strategy—maybe make it less verbose so that when an issue does arise, it’s easier to pinpoint the cause.

After lunch, I spent a bit of time tweaking our logging configuration. I added some more context to help us better understand what was happening in each service at any given moment. It wasn’t glamorous work, but sometimes the simplest changes can make a big difference.

In the evening, as I sat back and took stock of the day, it struck me that we were at a crossroads. We could continue down the path of ever-increasing complexity by adding more layers of automation and orchestration, or we could focus on simplifying our architecture to reduce cognitive load for everyone involved. The latter seemed like a better approach.

As I typed up my notes for the next day’s stand-up, I realized that this wasn’t just about solving immediate problems—it was also about setting a path forward. We needed to embrace simplicity and elegance in our designs, even if it meant sacrificing some of the latest and greatest technologies.

That night, as I reflected on the day, I couldn’t help but think about how far we’ve come and how much more we can achieve by keeping things simple. The tech world might be all about complexity, but sometimes the best solutions are the simplest ones.

This post captures a moment of struggle with the increasing complexity in our infrastructure while reflecting on simpler solutions that might bring us more benefits in the long run.