$ cat post/the-floppy-disk-spun-/-the-pipeline-hung-on-step-three-/-i-wrote-the-postmortem.md

14DEC20

the floppy disk spun / the pipeline hung on step three / I wrote the postmortem

Title: Kubernetes Complexity Fatigue: A Month of Debugging

December 14, 2020. I find myself sitting at my desk in a remote corner of the world, trying to untangle a mess of pods and services that just don’t want to behave. The month has been filled with endless debates about whether we should use ArgoCD or Flux for GitOps, and while it’s not exactly groundbreaking tech, it sure does make my hair grey faster.

Kubernetes Complexity Fatigue

I’ve been on this journey for years now, and I can confidently say that Kubernetes is a beast. It’s incredibly powerful, but oh so complex. The more you dig into it, the more you realize just how deep the rabbit hole goes. Today, I’m dealing with a pod that simply won’t restart after a failure. It’s a common issue, but this one feels particularly vexing.

The Setup

We’re using a cluster managed by EKS (Elastic Kubernetes Service) on AWS. Each node in our cluster has an eBPF program running to monitor network traffic and log it for us. This setup is fantastic for visibility, but it can also introduce points of failure that are harder to debug.

The pod I’m dealing with runs a simple microservice written in Node.js. It’s part of a larger monolith split into multiple smaller services, each responsible for different parts of our application stack. The service uses environment variables and secrets stored in AWS Secrets Manager, which adds another layer of complexity.

The Debugging Journey

I start by checking the pod logs to see if there are any error messages that might give me a hint. Nothing obvious jumps out at me, so I move on to inspecting the Kubernetes event history for clues. Here, I find something interesting: 0/1 nodes are available: 1 Insufficient memory.

Wait, what? We have plenty of RAM in this cluster, and our pods should be configured with adequate resources. I cross-check the node statuses, and they all appear healthy.

The Scheduling Issue

After a few moments of scratching my head, it hits me—there’s another pod on one of the nodes that has been consuming more memory than expected. This pod is running an eBPF program that, while useful for logging network traffic, seems to be gobbling up resources in the background.

I decide to manually evict this misbehaving pod from its node and see if it resolves the issue. Sure enough, once I do, the problematic Node starts accepting new pods again. Hallelujah!

Lessons Learned

This experience has reinforced a few things for me:

Resource Management: Even small services can consume significant resources, especially with eBPF running in the background. Monitoring and managing resource usage is crucial.
Cluster Sizing: Our cluster sizing needs to take into account not just our primary workload but also any additional services or tools we might run on it.
Pod Eviction: Having a reliable way to handle pod eviction can be a lifesaver when troubleshooting scheduling issues.

The Broader Context

This episode of Kubernetes complexity fatigue feels like the industry is reaching a tipping point with all the new technologies and methodologies being introduced. From ArgoCD to Flux, eBPF to GitOps, it’s enough to make your head spin. But at the end of the day, it’s about finding the right balance between simplicity and power.

Final Thoughts

As I wrap up this debugging session, I can’t help but feel a sense of satisfaction. Debugging Kubernetes is like trying to solve a complex puzzle with pieces that keep changing. It’s challenging, but also incredibly rewarding when you finally see the light at the end of the tunnel. Here’s hoping for smoother sailing in the new year!

That’s it for today. Back to work!