$ cat post/the-monolith-ran-/-the-incident-taught-us-the-most-/-the-stack-still-traces.md

the monolith ran / the incident taught us the most / the stack still traces


Title: The Kubernetes Conundrum


April 18, 2016. A day I remember all too well. It was the era of Kubernetes winning container wars, but it wasn’t without its share of frustrations and learning curves.

It started like any other morning when my team and I were looking at our cluster metrics in Prometheus. We noticed a spike in CPU usage on our production Kubernetes nodes. The graphs showed that pods were being constantly killed and restarted, which was unexpected given the steady state of our application. After some digging, we discovered that it wasn’t an issue with our code—it was the container runtime.

We were using Docker, and it seemed that the version compatibility between Docker’s cgroup implementation and Kubernetes’ CRI (Container Runtime Interface) had a bit of a kink. The problem manifested as pods being constantly evicted due to resource exhaustion, even though we had adequate resources. We spent hours trying different combinations of Docker versions and configuration tweaks, but nothing seemed to stick.

Finally, after a few days of head-scratching, we decided to take the leap and switch to containerd. It was a bold move at the time, as many were still on Docker or rkt, but we believed it would resolve our issues. The change required significant refactoring of our deployment scripts and some manual adjustments in Kubernetes manifests.

Transitioning from Docker to containerd wasn’t smooth. There were API differences that necessitated changes in how volumes and network policies were managed. We had to write a custom pod network plugin for containerd because the official ones didn’t support all of our needs. It was a lot of work, but it paid off when we saw those CPU spikes disappear.

This episode taught us the importance of sticking with one container runtime if possible. Docker and containerd share many similarities in their underlying components, making them a safer bet for compatibility issues compared to rkt or other runtimes that might not be as widely adopted.

As Kubernetes continued to evolve, so did our infrastructure practices. We started exploring Helm for managing our deployments more efficiently. It simplified the complexity of maintaining multiple environments and reduced manual configuration errors. But with every new tool comes a learning curve, and we spent quite some time wrestling with the nuances of Helm’s templating syntax and its dependency resolution.

Around this time, Istio began to gain traction as well. The concept of service mesh seemed promising, offering robust capabilities for traffic management and observability that were still somewhat lacking in Kubernetes itself. We started small, deploying a single sidecar proxy per pod just to see what it could do. It was eye-opening; the visibility into our services’ health and performance metrics was unparalleled.

But Istio wasn’t without its downsides. The complexity introduced by yet another layer of infrastructure meant we had to ensure that every team member understood how to use it properly. We spent countless meetings arguing about where best to integrate Istio, whether it belonged in the application code or if it should live entirely within our network topology.

By mid-2016, Terraform 0.x was still in beta and GitOps was just starting to gain traction. The idea of managing infrastructure as code resonated with us, but we were hesitant to fully commit to a tool when its version numbering suggested ongoing instability. We took a pragmatic approach, using Terraform for some environments while continuing to manually configure others.

Looking back on this period, it feels like the early days of Kubernetes adoption—full of excitement and challenges. The ecosystem was still maturing, with new tools and practices emerging almost daily. Each day felt like we were walking through uncharted territory, figuring things out as we went along.

But that’s what makes this journey so rewarding. We’re not just building systems; we’re actively shaping the future of infrastructure. Every issue we solve contributes to making our platform more robust and reliable. And every tool we adopt comes with its own set of trade-offs and learning opportunities.

So here’s to the Kubernetes conundrums, the Helm headaches, and the endless nights spent debugging container runtimes. We may have stumbled a few times, but each misstep brought us closer to mastering this incredible technology.

Stay tuned for more adventures in platform engineering!