$ cat post/the-branch-was-deleted-/-the-interrupt-handler-failed-/-i-wrote-the-postmortem.md

10DEC18

the branch was deleted / the interrupt handler failed / I wrote the postmortem

Title: Kubernetes Gotchas: A Year After I Thought I Knew It All

December 10, 2018. The air is crisp, and the holiday lights are starting to twinkle in my neighborhood. But inside my mind, there’s a quiet hum of frustration as I reflect on how far I’ve come with Kubernetes—and how much further there is to go.

It’s been exactly one year since I thought I had all the answers about Kubernetes. Back then, I was confidently deploying complex applications using Helm charts and Istio for service mesh, and I felt like a pro. But now, as I look back, I realize that my understanding of Kubernetes wasn’t nearly as solid as I’d imagined.

The Big Rollout

Last year, we decided to fully commit to Kubernetes at our company. We had a few microservices running in Docker containers, but this was going to be the big push. We planned a massive rollout with a handful of pods and gradually increased the deployment size over several weeks.

We were confident. We had tested everything in staging environments and thought we’d ironed out all the kinks. But as soon as we hit production, chaos ensued. Pods kept crashing, and services started to flake out one by one. I spent countless nights wrestling with Kubernetes logs, trying to figure out what was going wrong.

Resource Management

One of the first issues I encountered was resource management. We had pods that were running on nodes with insufficient CPU or memory resources. Kubernetes wasn’t automatically scheduling those pods to healthier nodes; it just kept restarting them until they eventually gave up and errored out. After digging into the kube-scheduler logs, I realized we needed a more nuanced approach to resource allocation.

I introduced custom metrics to the cluster using Prometheus, which allowed us to better understand how resources were being used across our pods. This helped us identify underpowered nodes and adjust deployments accordingly. We also started using kubectl top and kubetail to monitor our applications in real-time, giving us more visibility into what was happening.

Networking Headaches

Networking was another nightmare. Our application had a complex service mesh with Istio, which seemed like the perfect solution for handling traffic routing and security. But as soon as we deployed it, things got messy. Sidecar proxies started to misbehave, causing cascading failures throughout our services.

I spent weeks debugging these issues, only to realize that I was using istioctl commands incorrectly. Once I started reading more documentation and following best practices, the problems started to resolve themselves. We ended up writing a small tool to automate some of the common tasks with Istio, which greatly simplified our workflow.

Secrets Management

Secrets management is another area where we faced significant challenges. Initially, we just stored sensitive information in Kubernetes secrets, but as the number of secrets grew, it became increasingly difficult to manage them. We found ourselves manually editing YAML files and hoping nothing would go wrong.

After some research, I recommended using HashiCorp Vault for centralized secret management. This allowed us to securely store and retrieve secrets without hardcoding them into our applications. Integrating Vault with Kubernetes was a bit of a headache, but it paid off in the long run by providing better security and flexibility.

Lessons Learned

Looking back on this year, I realize that Kubernetes is far from a one-size-fits-all solution. It requires careful planning, ongoing monitoring, and continuous optimization. The tools like Helm, Istio, and Vault are powerful, but they need to be used correctly.

One of the biggest takeaways for me was the importance of setting up proper observability from the start. Tools like Prometheus and Grafana were essential in helping us understand what was happening within our cluster. Without them, we would have been completely in the dark.

Moving Forward

As I prepare for another year with Kubernetes, I’m excited about where this journey will take us. The platform engineering conversations that started last year continue to evolve, and new tools like Knative are emerging to simplify serverless deployments. But for now, my focus is on refining our Kubernetes setup, ensuring we have the right tools in place to handle any challenges that come our way.

This isn’t just about technical solutions; it’s about building a robust platform that can support our growing infrastructure and provide reliable services to our users. And while I may not know everything yet, I’m confident that with each challenge, my understanding of Kubernetes—and the tech world—will only grow stronger.

Happy holidays, everyone! Here’s to another year of learning and growth in technology.

Feel free to tweak this content as needed for your personal blog.