$ cat post/the-prod-deploy-froze-/-i-diff-the-past-against-now-/-it-was-in-the-logs.md

04JAN21

the prod deploy froze / I diff the past against now / it was in the logs

Title: Kubernetes Complexity Fatigue: A Year of Learning

January 4, 2021. Another Monday morning starts with a cup of coffee and the usual barrage of emails. I open Slack and see a flood of messages about our latest k8s issue. It’s becoming more common to spend half my day triaging and solving k8s problems, which is frustrating considering how much time we’ve invested in automating and streamlining this.

This past year has been a rollercoaster of Kubernetes complexity fatigue. We started 2020 with big plans for our platform engineering team—rolling out new features, improving security, and making our infrastructure more robust. But somewhere along the way, k8s seemed to get in the way.

The Beginning

Back in March, we launched a fresh Backstage instance. It was supposed to be a one-stop shop for internal developers, providing all the tools they needed. But with each new integration, the complexity of our Kubernetes setup grew. We spent weeks setting up Istio, Knative, and a whole host of other services, only to find ourselves spending more time debugging than coding.

The Crisis

One particular incident stands out. Around June 15th, our production environment started showing strange behavior. Pods were crashing left and right, and our monitoring tools couldn’t pinpoint the root cause. We spent days chasing down potential issues—resource limits, network bottlenecks, you name it. Finally, after countless meetings and discussions, we realized that one of our custom controllers had been misbehaving due to a subtle bug.

It was humiliating. Here we were, a team that prided ourselves on knowing k8s inside out, and we still couldn’t get this basic setup right. The whole incident highlighted the growing pains we were experiencing with Kubernetes.

Learning from Failure

The failure wasn’t just technical; it was cultural too. We realized that our approach to k8s had become overly complex. Every new service or tool added another layer of complexity, making the entire system harder to maintain. So, in early 2021, we started a project: KubeSimplicity.

KubeSimplicity focused on simplifying our Kubernetes setup by removing unnecessary components and standardizing practices. We started with a clean slate, re-evaluating every single service and tool. It was painful at first—getting rid of Knative or Istio required significant changes in how we built our applications. But as we made progress, the benefits became clear.

The Shift

By mid-2021, we had significantly reduced the complexity of our k8s setup. We switched to a simpler service mesh and focused on core Kubernetes features. This shift not only improved stability but also freed up time for more meaningful work like building new features or enhancing security practices.

Reflecting on 2020

Looking back at 2020, I see that we faced the same challenge many others did: managing the ever-growing complexity of modern infrastructure. Kubernetes is a powerful tool, but it requires constant vigilance and careful management to avoid falling into the trap of over-engineering.

This year, our goal is to keep things simple while still leveraging k8s’s full potential. We’re focusing on automation, standardization, and continuous improvement—lessons I’m sure many others in tech are grappling with as well.

So here’s to 2021—the year we tackle Kubernetes complexity fatigue head-on and find a balance between power and simplicity.

That was my journal entry for January 4, 2021. It encapsulates the challenges and lessons learned from managing our k8s setup during that tumultuous period.