$ cat post/telnet-to-nowhere-/-i-git-bisect-to-old-code-/-the-container-exited.md

14OCT19

telnet to nowhere / I git bisect to old code / the container exited

Title: Kubernetes Complexity Fatigue: A Personal Take

It’s October 14th, 2019, and the world is a strange mix of digital calm and real-world chaos. The tech industry buzzes with the quiet hum of new tools like ArgoCD and Flux GitOps maturing while eBPF continues to gain traction in the background. Kubernetes, which has been my primary focus for the past few years, now feels like it’s causing more stress than it’s worth.

The Bumpy Ride

A couple of weeks ago, I faced one of those “it was supposed to work” moments that are a staple of any infrastructure engineer’s day-to-day. We had recently upgraded our cluster to Kubernetes 1.15 and, as you might imagine, things didn’t go exactly as planned.

The initial rollout went smoothly—hundreds of pods came up without a hitch. But then it started. The logs began filling with livenessprobe failures for a few key services. At first, I thought maybe we had some misconfigurations or resource issues, but as the day wore on, more and more pods started failing their liveness probes.

Digging Into the Issue

The first thing to do was gather data—metrics from Prometheus, logs from Fluentd, and a thorough audit of our deployment manifests. After an hour or so of sifting through the chaos, I found the culprit: we had inadvertently bumped up the livenessProbe.initialDelaySeconds value in one of our critical services.

In Kubernetes 1.14 and earlier, the default was 30, but it was bumped to 60 in 1.15. This change can be a bit of a minefield for deployments that are sensitive to startup time or require some initial setup before being fully functional. A simple misconfiguration could cause all sorts of issues.

The Fix

I rolled back the deployment, reducing the probe delay and redeploying. Almost immediately, everything started coming up as expected. It was a humbling reminder of how much can go wrong with even small configuration changes in Kubernetes.

This incident led me to reflect on the broader challenges we face in managing Kubernetes clusters at scale. The complexity has grown significantly over the years, from simple deployments to managing stateful applications, networking, and security policies. And while tools like ArgoCD are making some of these tasks easier, they don’t necessarily alleviate the need for deep understanding.

eBPF: A Silver Lining?

One bright spot in this sea of complexity is eBPF (Extended Berkeley Packet Filter). It’s been gaining attention as a powerful tool for low-level networking and tracing. Imagine being able to inject code into the kernel without recompiling it—that’s the magic of eBPF.

While I haven’t yet dived deep into using eBPF, the thought of having such fine-grained control over system events is exhilarating. It could be a game-changer for performance tuning and debugging in Kubernetes clusters.

Looking Forward

As we continue to scale our infrastructure, I find myself increasingly interested in how we can leverage tools like eBPF to simplify some of the more complex tasks. SRE roles are proliferating, which is great because they bring a valuable perspective on reliability and observability. Internal developer portals like Backstage are also becoming essential for managing application lifecycles and infrastructure.

But even with all these advancements, Kubernetes remains a beast that requires constant vigilance. The complexity fatigue isn’t going away anytime soon, but perhaps tools like eBPF will help make it more manageable.

For now, I’ll focus on learning how to use eBPF effectively and keep an eye out for any other emerging technologies that could simplify our lives.

Until next time,

Brandon