$ cat post/bash-script-from-ninety-/-the-firewall-rule-was-too-strict-/-the-repo-holds-it-all.md

16DEC19

bash script from ninety / the firewall rule was too strict / the repo holds it all

Title: Kubernetes Complexity Fatigue Hits Home

December 16, 2019. The day I realized that the grand vision of Kubernetes as a silver bullet for all our orchestration needs was, well, not as shiny in real life.

It’s been over two years since we made the big switch from Mesos to Kubernetes across our entire fleet. I remember the excitement—Kubernetes seemed like it would handle everything from stateful services to deployments with ease. But now, two years later, the complexity is starting to show its ugly head.

The Setup

We were running a mix of applications: some stateless microservices, stateful databases, and even a few monoliths that we had started refactoring into microservices. Kubernetes was supposed to make this seamless, but instead, it’s become a bit of a Rube Goldberg machine.

The Problem: StatefulSets and Persistent Volumes

One of the biggest issues is with our stateful applications. We have a few databases like MySQL and PostgreSQL that require persistent storage. Setting up StatefulSets seemed straightforward enough in theory, but when you get into managing multiple PVCs and PVs across different namespaces, it’s a nightmare.

We had to deal with issues around PVC provisioning, dynamic volume attachment, and even some bugs in our cluster that caused data corruption during upgrades. It’s one thing to think about these things when they’re on the roadmap; it’s quite another when you’re actually staring at failed deployments due to persistent storage issues.

The Debate: Helm vs. Operator

To make matters worse, we’ve been debating whether to stick with Helm for our application deployments or switch to Operators. Both have their pros and cons. Helm is simpler, but it can lead to a lot of manual work around dependencies and custom resource definitions (CRDs). Operators, on the other hand, are more declarative but add another layer of complexity.

Our internal developer portal, using Backstage, has made some headway in standardizing our application management practices. We’ve created reusable Helm charts and set up CI/CD pipelines to deploy new applications. But even with these tools, we’re still spending too much time on the plumbing and not enough on actually developing features.

The Future: SRE and Platform Engineering

As I reflect on all this, it feels like we might be at a crossroads. With the rise of SRE roles and platform engineering as formalized disciplines, maybe it’s time to reassess our approach. Maybe instead of trying to make Kubernetes do everything, we should focus on building better abstractions that sit on top of it.

ArgoCD and Flux are starting to show promise in helping us manage our deployments more efficiently. Maybe it’s time to invest in a more robust GitOps strategy. But as always, the devil is in the details—getting everyone on board with these new tools and processes will be a challenge.

Conclusion

It’s funny how quickly you can go from being an early adopter of Kubernetes to feeling like you’re drowning in its complexity. I think we all need to take a step back, reassess our approach, and maybe even consider some of the newer tools and patterns that have emerged since we first jumped on board.

As for now, I’ll keep pushing through these challenges, one debug session at a time. Maybe next year, things will be a bit smoother.

Stay tuned!