nmap on the lan / the database was the truth / it was in the logs

Title: Kubernetes Complexity Fatigue: When DevOps meets Real-World Problems

May 4, 2020 was another ordinary Wednesday in the world of tech, but to me, it felt like a landmark day. I woke up to the usual alarm and started my journey into the world of DevOps, Platform Engineering, and Kubernetes (K8s) complexity fatigue.

Today, I want to talk about something that’s been gnawing at me for months—Kubernetes complexity. It’s not just that K8s is hard; it’s more that as a platform engineer, I’m constantly trying to balance the needs of developers with the realities of infrastructure. The recent rise in SRE roles and the proliferation of internal developer portals like Backstage only add layers to this challenge.

The Complexity Conundrum

Kubernetes has become the de facto standard for container orchestration, but it’s a double-edged sword. While it provides incredible flexibility and scalability, it also introduces an overwhelming amount of complexity. Just think about all the things that can go wrong: from misconfigured resources to issues with networking, storage, and even security.

In my latest project, we were working on a large-scale deployment where developers needed robust tools for managing their applications. We built out a comprehensive Backstage portal to give them visibility into their services and the underlying infrastructure. However, as the system grew, so did the complexity of managing it all—from deploying new applications to scaling resources during peak loads.

A Day in the Life

One day, I found myself buried under a support ticket about an unexpected outage. Developers were seeing errors related to a custom resource definition (CRD) that they had created. It was a complex issue because we needed to understand not just what the CRD did but also why it wasn’t functioning as expected.

After a few hours of tracing logs and debugging, I realized there was an underlying configuration issue with how the CRD interacted with other components in our K8s cluster. It’s moments like these that remind me why Kubernetes complexity can be such a challenge. You’re not just dealing with the application itself but the entire infrastructure stack.

Learning from Failure

The experience taught me a valuable lesson about documentation and communication. We need to do better at documenting complex configurations and ensuring that everyone who uses our systems understands what they are doing. I started advocating for more standardized templates and clearer guidelines in our developer portal.

The Remote-First Shift

As the world shifted to remote work due to the pandemic, the importance of a well-designed platform became even more critical. We had to quickly adapt our infrastructure to support this new reality. Tools like ArgoCD and Flux GitOps helped us manage deployments more efficiently from any location. But these tools come with their own set of challenges—they can be brittle if not properly configured.

Moving Forward

In the coming weeks, I plan to focus on improving our internal documentation around K8s best practices. We’ll also explore using eBPF (Extended Berkeley Packet Filter) in more areas of our infrastructure. Evidently gaining traction, it could provide us with powerful new tools for monitoring and troubleshooting.

Reflecting on the Era

Looking back at May 2020, it’s clear that we were navigating a complex landscape. The tech industry was buzzing with activity around developer portals like Backstage, SRE roles proliferating, and Kubernetes complexity fatigue setting in. As I continue to wrestle with these challenges, I’m reminded of the importance of staying grounded and learning from each problem we encounter.

So here’s to another day in the world of platform engineering—complexity and all. Keep your wits about you, stay curious, and always seek improvement.

Feel free to tweak or expand this post as needed!