Kubernetes Conundrum: A Deep Dive into Chaos Engineering

September 17th, 2018. Kubernetes was really starting to solidify its position as the default container orchestration platform in many teams’ infrastructures. We were at a crossroads—a place where everyone wanted the flexibility and reliability Kubernetes promised, but few knew how to manage it effectively.

At my company, we had been running our microservices with Kubernetes for a while now. The setup was looking good on paper—stateless services, rolling updates, zero-downtime deploys. But when things started to go south, it was clear that the real-world challenges were much harder than I’d anticipated.

One particularly painful issue cropped up during one of our feature releases. We had introduced a new service with an internal API gateway that was supposed to route traffic from old services to the new ones. In theory, this should have been simple—after all, Kubernetes has built-in support for such scenarios through its Service and Ingress resources.

However, reality hit us hard when we started seeing unexpected behavior in our deployment. The service would crash occasionally during updates, causing cascading failures that took down other services. We had to roll back the update multiple times before settling on a stable version. It was a frustrating experience, but it sparked a lot of discussion around reliability and chaos engineering.

Chaos engineering is an approach where you purposefully introduce failures into your system to ensure that it remains resilient under real-world conditions. While Kubernetes has some basic tools for testing (like kubectl drain and taint), we realized we needed something more robust to simulate different failure modes.

We started exploring the Chaos Toolkit, a framework for implementing chaos engineering experiments. The toolkit provided a structured way to define and run experiments on our system. We crafted an experiment that would randomly kill pods within one of our critical services during an update, then observe how the remaining pods handled the load and failed over to other nodes.

Running this experiment was eye-opening. It showed us how poorly prepared we were for unexpected failures. The service would indeed fail when a pod went down, but the recovery wasn’t as smooth as we had hoped. We saw issues with liveness probes, misconfigurations in our deployments, and even some subtle bugs that only surfaced during these stress tests.

The learnings from this experiment were invaluable. It forced us to revisit our deployment strategy and ensure that our services were resilient by design. We started implementing better health checks, refining our rolling update strategies, and improving our monitoring capabilities. The Chaos Toolkit became an essential part of our pre-release testing process.

This experience made me realize the importance of embracing chaos engineering as a discipline within platform engineering. It’s not just about running Kubernetes; it’s about building systems that can withstand unexpected failures gracefully. As I continue to navigate the complexities of containerized infrastructure, these lessons will undoubtedly shape my approach and inform future projects.

In many ways, this episode was a microcosm of what was happening in tech at the time. The hype around Kubernetes was real, but so were the challenges. Platforms like Helm and Istio started to emerge as solutions for managing complexity, while GitOps and platform engineering conversations began to gain traction. But it all came down to one core truth: you can’t just set up a container orchestration solution and expect everything to magically work.

Kubernetes was winning the container wars, but we needed to ensure that our systems were built with robustness and resilience in mind. And for that, chaos engineering became my new best friend.