$ cat post/when-debugging-meets-real-life:-a-kubernetes-cluster-woes-story.md
When Debugging Meets Real Life: A Kubernetes Cluster Woes Story
On October 18, 2021, I found myself knee-deep in a Kubernetes cluster that was acting more like quicksand than the solid ground my application needed. The problem? Unreliable service discovery and persistent pod crashes. It felt like every other day was spent debugging some new wrinkle in our deployment pipeline.
The Setup
We had been using ArgoCD for GitOps, which is a fantastic tool that keeps our clusters in sync with our configuration files stored in Git. However, we were running into issues where pods would sporadically crash and fail to restart properly. This was particularly frustrating because our application was mission-critical, and downtime was not an option.
The Symptoms
The symptoms were clear enough: random pod crashes, inconsistent logs, and a lack of meaningful error messages. It made me feel like I was trying to debug the Matrix. We had set up proper logging with fluent-bit and Elasticsearch, but it seemed like nothing we did could catch the exact moment when something went wrong.
The Investigation
I spent hours digging through logs, pod events, and Kubernetes configuration files. The frustration mounted as I realized that ArgoCD was not only the tool causing these issues; it might have been exacerbating an existing problem in our cluster’s infrastructure.
One of the first things I checked was the service mesh we were using, Istio. We had implemented it to manage traffic routing and enable advanced networking features. However, it seemed like a misconfiguration here or there could be leading to pods failing to connect properly.
Step 1: Service Mesh Audit
I decided to audit our Istio configuration. After digging through the config files, I noticed that some of the service annotations were not correctly applied to all services. This inconsistency might have been causing intermittent connectivity issues.
Step 2: Network Policies Review
Next, I looked at our network policies to ensure they weren’t blocking traffic unnecessarily. It’s easy to overcomplicate things with network policies, and often a simpler configuration works better. After simplifying some of the rules, I saw an improvement in stability but not a complete resolution.
Step 3: Pod Start-Up Issues
I moved on to investigate why pods were failing at start-up. This is where Kubernetes’ infamous “container created” event comes into play—where you see a pod starting and then immediately dying without any meaningful log output. I decided to enable debug logging for the containers in question.
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: my-image
env:
- name: KUBERNETES_DEBUG
value: "true"
Enabling debug logs gave me more insight into the actual start-up process, but it still didn’t reveal the root cause.
The Breakthrough
After a few days of relentless debugging, I finally had an epiphany. It wasn’t just our Istio or network policies; it was something more fundamental—our storage class and PersistentVolumeClaims (PVCs). Our application stored large amounts of data on disk, and the PVCs were not being correctly created.
I realized that we needed to revisit how we manage volumes in Kubernetes. By default, the emptyDir volume type is used, which resides only in memory. This might work for small applications but can be problematic for stateful services like ours.
Solution: Persistent Storage
To address this issue, I updated our PVCs to use a persistent storage class that provides actual disk storage. This change was straightforward once identified, and after redeploying the affected pods, stability improved significantly.
Reflection
Debugging Kubernetes clusters can be a frustrating experience, but it’s also an opportunity for growth. In this case, we learned about the importance of consistent service annotations and proper volume management in stateful applications.
This episode highlighted how complex distributed systems can be and reinforced my belief that simplicity is often key to robust infrastructure. It’s easy to get bogged down with complex configurations and tools, but sometimes a step back to basics can reveal the real issues.
In the end, it was a reminder of why I love working on platforms—there’s always something new to learn and challenges to solve. And who knows, maybe next time I’ll have an easier fix!