$ cat post/cron-job-i-forgot-/-the-database-was-the-truth-/-the-pipeline-knows.md

27MAY19

cron job I forgot / the database was the truth / the pipeline knows

Title: Debugging Kubernetes in a Remote World

May 27, 2019 was just another day in the life of an engineer, but for me, it was a reminder that sometimes, even with all the tools at our disposal, debugging and figuring things out is still an art. That day started like any other: I woke up, logged into my laptop (which I now have to share with my cat), and headed straight into a remote support session.

The Setup

My team uses Kubernetes as our main orchestration tool for deploying applications. Today, we had a customer who was having trouble scaling their services. They were seeing inconsistent behavior across different pods, and it wasn’t clear if the issue lay in their application or the Kubernetes cluster itself.

We jumped into a call with the ops team at the customer’s end. The first thing I noticed was that the logs weren’t as clean as they should have been. There seemed to be some noise, but we couldn’t pinpoint where it was coming from. I asked them if they had tried looking through the Kubernetes events, and they said yes—they hadn’t seen anything unusual.

Digging Deeper

I pulled up the cluster in my browser using Backstage, our internal developer portal. It’s a great tool for visualizing your infrastructure, but today it felt like it wasn’t enough. We needed more granular data to understand what was happening inside those pods.

After some discussion, we decided to use kubectl and kubectl top to get more insights. The output was a bit overwhelming at first—there were so many containers running, each with its own set of metrics. It took us a while to filter through all the noise and identify the problematic pods.

Then I remembered eBPF could be really useful here. We decided to use bpftrace to get some real-time insights into what was going on within the pods. We wrote a few simple probes that would trace system calls, which quickly revealed an issue with how the application was interacting with the file system.

The Eureka Moment

The moment of clarity came when we realized that one of the pods had been running a specific version of their code that was causing too many read and write operations on the disk. This wasn’t something that showed up in the logs, but it was enough to slow down the entire cluster.

We managed to swap out the problematic pod with an updated version, which resolved the issue almost immediately. The customer’s team was relieved; they had been pulling their hair out trying to figure this one out.

Lessons Learned

This experience reinforced my belief in the importance of having a diverse set of tools and techniques when debugging Kubernetes clusters. It’s not enough just to rely on standard logging or even fancy developer portals like Backstage. Sometimes, you need to get down and dirty with eBPF traces or other low-level diagnostics.

It also highlighted how important it is for ops teams and developers to collaborate closely. We often operate in silos, but solving complex issues requires us to work together and share insights from different perspectives.

The Future

Looking back on this experience, I see that the tech landscape was shifting rapidly even then. SRE roles were becoming more prominent, and GitOps tools like ArgoCD and Flux were starting to mature. As we move forward, I believe these tools will play an increasingly important role in helping us manage Kubernetes clusters efficiently.

For now, though, it’s back to the basics: debugging is still about persistence, curiosity, and a willingness to explore every possible angle. And sometimes, that means getting your hands dirty with some low-level system tracing just to figure out what’s going on.

That was my day in tech on May 27, 2019—a reminder of the challenges we face as engineers and how much there is still to learn.