$ cat post/the-monolith-ran-/-that-script-still-runs-somewhere-deep-/-uptime-was-the-proof.md

12JUN17

the monolith ran / that script still runs somewhere deep / uptime was the proof

Title: Debugging Kubernetes Clusters: A Day in the Life

June 12, 2017. The day I woke up to yet another Kubernetes cluster that was acting up, and my morning coffee seemed like a mere distraction from what lay ahead.

Last night’s deployment had failed on our staging environment, and our logs weren’t giving us much useful information. We were in the middle of a critical feature release, and this was not good. I grabbed my laptop, ready to dive into the chaos that is Kubernetes debugging.

The Setup

We’ve been using Kubernetes for about a year now, and while it’s become our go-to container orchestration platform, it still manages to throw curveballs. Our staging cluster was running on AWS, with three master nodes and 15 worker nodes. Each node had three replicas of each service, spread across different zones.

The Issue

The error messages were vague at best: “container terminated unexpectedly” or “pod failed to start.” I knew these weren’t very helpful. What I needed was a better understanding of the network and storage between pods, as well as deeper insights into what was happening inside each container.

Step 1: Ephemeral Containers

I started by launching an ephemeral container on one of the failing pods. This is a feature that allows you to run short-lived containers on a node to gather debugging information without changing the original application’s state. The command looked something like this:

kubectl exec -it <pod-name> -- /bin/sh

Once inside, I ran journalctl and dmesg to see if there were any system-level errors that might give us a clue.

Step 2: Pod Events

Next, I checked the pod events using:

kubectl get pods -o yaml <pod-name>

This gave me more context about why the pod was failing. Often, Kubernetes provides helpful messages like “container image not found” or “image pull policy does not allow pull,” which helped narrow down the problem.

Step 3: Network Troubleshooting

Given that network issues are common in Kubernetes clusters, I decided to check the network setup. I used kubectl exec to run netstat -tunlp and curl <service-url> from inside a pod to see if services were reachable.

Step 4: Resource Limits

I also checked resource limits using:

kubectl describe pods <pod-name>

This command revealed if any of the containers had exceeded their CPU or memory limits, which could cause them to terminate unexpectedly.

The Breakthrough

After a few hours of sifting through logs and running various commands, I noticed something peculiar. One of the worker nodes was consistently hitting its EBS volume limit for writes, causing the pods on that node to fail. This wasn’t immediately obvious from our monitoring tools at the time.

I updated the node’s EBS settings in AWS and watched as the cluster started functioning normally again. The feature release could proceed without further issues.

Reflections

This incident highlighted some of the challenges with running Kubernetes clusters, especially in a hybrid cloud environment like ours. While we had solid logging and monitoring set up, they often didn’t provide enough context to diagnose the root cause quickly.

I realized that while Kubernetes makes deployment easier, it also introduces complexity that traditional tools aren’t always prepared for. We needed better visibility into the network and storage layers of our clusters.

Moving Forward

In the weeks following this incident, we started exploring tools like Prometheus and Grafana for more detailed monitoring. We also looked into integrating Kibana with Kubernetes to get a broader view of cluster health. These steps were part of a larger effort to modernize our infrastructure operations.

Debugging Kubernetes is never straightforward, but it’s a necessary evil in today’s fast-paced development cycles. Every issue we solve brings us closer to mastering this powerful tool.

This was just another day in the life of a platform engineer at the time. Debugging Kubernetes isn’t glamorous, but it’s a critical part of keeping our services up and running.