Kubernetes Complexity Fatigue and the Siren Song of SRE

It’s November 4, 2019, and I find myself reflecting on a few months that have been both exhilarating and exhausting. Kubernetes has become my workhorse over these years, but lately, it feels like it’s turning into an albatross. The complexity is reaching a point where each new release seems to add another layer of what feels like unnecessary complication.

Just last week, I found myself arguing with a colleague about the best way to manage our services in Kubernetes. The debate revolved around how much boilerplate code we should write versus leveraging more sophisticated tools and operators. There’s this siren song in the industry—every new tool or framework promising to simplify things—and it’s hard not to get caught up in it.

On one side, you have ArgoCD and Flux GitOps, both of which have been gaining traction. They promise a simpler way to manage your Kubernetes clusters by syncing them with version-controlled configuration files. But on the other hand, there’s always that whisper of “why do I need yet another tool?”

The truth is, every additional piece in our tech stack adds another layer of complexity and potential failure points. We’ve been using ArgoCD for a while now, but it’s not without its quirks. Recently, we had a sync issue where the state wasn’t being propagated correctly, leading to inconsistent deployments across environments. It took quite some time to track down, especially since the logs didn’t provide much insight into what was happening.

As if this weren’t enough, the SRE role is becoming more prevalent in our organization. There’s a growing recognition that ops and dev need to work closely together to ensure reliability. I’ve found myself spending more time on monitoring and logging than I would have liked. This isn’t necessarily a bad thing; it does make us more aware of potential issues before they become crises. But it also means I’m constantly juggling between development, operations, and infrastructure responsibilities.

Then there’s the matter of remote-first infra scaling due to COVID-19. We’ve had to rapidly adapt our tools and processes to support a fully distributed team. This has forced us to re-evaluate how we collaborate, share knowledge, and ensure consistency across different environments. Slack, for example, became the new battleground for discussions—sometimes productive, but often mired in endless meetings and notifications.

Speaking of which, I’ve been giving some thought to our internal developer portal, Backstage. It’s an exciting project that aims to provide a single source of truth for all our services and applications. However, as we dig deeper into the tooling, it feels like we’re adding another layer of abstraction on top of everything else. The promise of automation and self-service is tempting, but so far, the reality has been more complex than expected.

And then there’s eBPF, a technology that’s starting to gain traction. It promises to transform how we monitor and optimize our systems by directly manipulating kernel data structures. But it’s still in its early stages, and I’m not sure if it’s ready for prime time just yet.

In the end, what really matters is finding the right balance between simplicity and functionality. We need tools that make our lives easier without adding unnecessary overhead. It’s a fine line to walk, but one we must navigate carefully.

So here’s to another month in tech—full of challenges, lessons learned, and endless debates about which tool is best. Let’s keep pushing the boundaries while remembering why we started: to build great software that people can use every day.

Until next time, Brandon