$ cat post/the-kernel-panicked-/-i-traced-it-to-the-library-/-i-kept-the-old-box.md

the kernel panicked / I traced it to the library / I kept the old box


Title: Kubernetes, Helm, and the Day I Thought I Knew It All


March 26, 2018. A day that will go down in my tech calendar as “the day I learned a lot about Kubernetes.” It’s funny how quickly you can get complacent when everything seems to be working smoothly.

Let me set the scene. We’re running a mix of services on Kubernetes clusters, using Helm for our application deployments and managing with a bit of manual Terraform sprinkled in. Prometheus is our monitoring backbone, and Grafana provides us with the pretty dashboards. Sounds familiar, right?

The Setup

We have a service called log-collector, which was originally deployed as part of our logging pipeline. It was running happily on its own cluster, and we had a basic Helm chart to deploy it every time someone needed another instance. However, a few weeks ago, I had this brilliant idea: why not run all of these components together in one cluster? We could save some costs, make things more efficient, and maybe even reduce our ops overhead.

So, I went ahead and combined the log-collector with other services into a single namespace. I spent an afternoon tweaking the Helm chart to support multiple instances, and then merged it into master. “Problem solved,” I thought smugly as I pushed the changes.

The Fiasco

Fast forward to March 26th, just as I’m about to get on the team’s standup call, my pager starts going off. It’s a critical alert from our monitoring system: log-collector is crashing left and right! My first instinct was that it must be some flaky application code or something similar. But then, I noticed something strange: all of these crashes were happening around the same time every day.

I dove into the logs and saw this:

2018-03-26T15:47:02Z E | failed to reconcile: error when creating "": admission webhook "validate.log-collector.kube-system.svc/failure" denied the request: ...

The failure webhook was the culprit. It turned out that I hadn’t properly configured it for our new cluster setup, and every time a change happened in the namespace, the webhook would reject it.

I tried to debug this on my own, but as usual with Kubernetes, things are not always as simple as they seem. The webhook logs were vague, and I couldn’t figure out what exactly was causing the issue. I spent the next hour trying different configurations and changes, but nothing seemed to work.

The Lesson

After a frustrating day of debugging, I finally called in my colleague who’s a Kubernetes ninja. As we sat together looking at the code, he pointed out that the webhook configuration had an invalid path parameter. I felt like a complete idiot for missing it—Kubernetes still has its quirks, and sometimes you just don’t catch everything.

This experience taught me several things:

  1. Complacency is the enemy: Just because something worked before doesn’t mean it will always work in new contexts.
  2. Documentation and testing are key: I should have spent more time writing thorough documentation for the Helm chart and tested the webhook configuration thoroughly.
  3. Don’t assume you know everything: Even when things seem simple, there’s often a deeper layer of complexity that can trip you up.

Moving Forward

Now, we’ve updated our Helm chart to include better error handling and improved testing mechanisms. We also plan to improve our monitoring by adding more specific alerts for these kinds of issues in the future.

This experience is a great reminder that while Kubernetes is incredibly powerful, it still requires careful attention and thorough understanding. The tech landscape may change rapidly, but the principles of good engineering and problem-solving remain constant.

So, as I continue with my day, I’ll keep those lessons close to heart—learning from mistakes and staying humble in the face of complex problems.