$ cat post/the-firewall-dropped-it-/-the-heartbeat-skipped-at-cutover-/-the-build-artifact.md

23DEC19

the firewall dropped it / the heartbeat skipped at cutover / the build artifact

Title: Kubernetes Complexity Fatigue and the Art of Keeping It Simple

December 23, 2019. The air is crisp outside, but in my office, it’s a bit more warm as I sit down to write this blog post. The holiday season feels like a distant echo as we’re deep into our year-end sprint. Kubernetes has become the de facto standard for container orchestration, and with that comes an inevitable complexity fatigue.

I’ve been working closely with ArgoCD and Flux GitOps for months now. They’re powerful tools, but they also add layers of abstraction that can sometimes obscure the underlying architecture. Today, I found myself wrestling with a problem that was, in many ways, quite simple, yet frustratingly complex.

The Problem

We were seeing sporadic issues with our Kubernetes cluster’s state reconciliation process. Pods would hang in a “CrashLoopBackOff” state, and we couldn’t determine the root cause. After days of digging through logs and monitoring metrics, I found that the issue was related to how we managed sidecar containers for logging.

The Root Cause

The problem stemmed from an over-zealous use of livenessProbe and readinessProbe configurations in our deployment manifests. We had inadvertently set up these probes to be overly aggressive, causing them to trigger too frequently under normal conditions. This led to excessive load on the sidecar container, which was supposed to handle logging for a stateless application.

The Fix

The solution was simple: relax the probe intervals and thresholds. I made some adjustments in our deployment YAMLs, setting initialDelaySeconds higher and reducing the frequency of probes. It took just a few minutes of work, but it required a deep dive into understanding how these probes interact with sidecar containers.

The Learning

This experience brought me back to the basics of Kubernetes configuration. It’s easy to get lost in the details of GitOps workflows and advanced monitoring tools, but sometimes the simplest solutions are right under our noses. We need to keep reminding ourselves that Kubernetes is a tool, not an end-all-be-all solution.

The Broader Implications

This little episode of complexity fatigue has broader implications for our team and beyond. As we continue to scale operations in remote-first environments, maintaining simplicity becomes crucial. Every additional layer of abstraction or configuration adds potential points of failure and reduces overall maintainability.

As platform engineers, it’s our job to strike the right balance between innovation and practicality. We must stay vigilant against complexity creeping into our systems and remember that often, the simplest solutions are the best ones.

The Takeaway

In 2019, as we look back at the year, I’m reminded of the importance of keeping things simple. Whether it’s through GitOps practices or Kubernetes configurations, simplicity in design can lead to fewer headaches down the line. Let’s resolve to keep our systems lean and clean, even as we embrace the latest tools and technologies.

Happy holidays, everyone! May 2020 bring us clear skies and simpler challenges.

Signing off with a sense of peace that comes from resolving an issue, albeit small in scale but significant in impact on our system.