Debugging Kubernetes in Production: A Love/Hate Story

January 6, 2020. I woke up to the sound of my alarm and a stack trace that just wouldn’t go away. It was one of those mornings where you know it’s going to be a long day.

Last night, during our monthly on-call rotation for the platform team, we received an alert: “kubelet failed to pull pod image.” Now, Kubernetes has its share of quirks and edge cases, but usually these kinds of issues are pretty straightforward. This time was different.

The stack trace pointed to a misconfiguration in one of our cluster nodes. But wait—this node had been working fine for weeks! How could it suddenly stop pulling images? It felt like some kind of Kubernetes versioning issue, or maybe even a bug in the latest update that just slipped through.

As I dug deeper into the logs and tried to understand what was happening, my frustration grew. Kubernetes is one of those systems that you love for its flexibility and capabilities but hate when it starts giving cryptic errors. It’s like trying to debug a complex program with no clear entry point.

I spent hours looking at different nodes, checking their resource usage, and cross-referencing them against our monitoring tools. Eventually, I stumbled upon an issue in the kubelet configuration that was causing it to fail silently when pulling images from a specific registry. Once I fixed that misconfiguration, everything started working again.

But this wasn’t the end of my Kubernetes woes for the day. A few hours later, another alert came through: “Pod is stuck in ContainerCreating state.” This time, the pod in question was part of our internal developer portal infrastructure built with Backstage. The logs showed that it was waiting to mount a volume from an external source.

Mounting volumes can be tricky even on a good day, but Kubernetes makes it seem like a full-time job. I had to figure out if it was an issue with the PersistentVolumeClaim, the storage class configuration, or something else entirely. It turned out that there was a permissions issue on the volume mount point that wasn’t being surfaced properly in the logs.

After fixing those permissions, I realized how much of Kubernetes relies on proper setup and consistent configurations across nodes. It’s like trying to build a complex puzzle where each piece needs to fit just right. A single misconfiguration can cause all sorts of issues downstream.

As I sat back from my desk, sipping some coffee, I couldn’t help but think about the state of platform engineering in 2020. SRE roles were becoming more prevalent as companies realized they needed a dedicated team to handle these kinds of infrastructure issues. Backstage was gaining traction as an internal developer portal, and we were starting to see more teams adopt GitOps practices with ArgoCD and Flux.

But despite all the tools and methodologies available, Kubernetes still felt like a wild west. It’s not just about writing code anymore; it’s about managing a complex ecosystem of services that can bite you at any moment. And even as these systems become more mature, there are always new challenges to face.

In the meantime, I went back to working on improving our monitoring and alerting systems so we could catch issues like these faster in the future. Kubernetes may be complicated, but it’s also incredibly powerful. With the right tools and a bit of perseverance, you can make it work for your team.

That’s what I’ve been dealing with today. Debugging Kubernetes in production is both frustrating and rewarding. It’s one of those days where you feel like you’re fighting an uphill battle, but you keep pushing through because the end result—deploying a reliable service that works seamlessly—is worth it.