$ cat post/the-daemon-restarted-/-the-config-file-knows-the-past-/-i-wrote-the-postmortem.md

04APR16

the daemon restarted / the config file knows the past / I wrote the postmortem

Title: Container Chaos: Debugging Kubernetes in Real Life

April 4, 2016. It’s a Wednesday and I’m knee-deep in Kubernetes, the container orchestration tool that’s been all the rage lately. We’ve just finished migrating our production services to this new system, and while it seems like everyone is singing its praises, there’s still an air of uncertainty. Today, I found myself staring at a cryptic error message, trying to figure out why one of our pods kept crashing.

The environment was set up using Kubernetes, with the usual suspects: Helm for package management, Envoy for service mesh, and Prometheus + Grafana for monitoring. We were also experimenting with serverless, but that’s on hold due to some lingering doubts about its maturity in production environments.

I had a nagging feeling as I stepped through the logs. Something just didn’t feel right. The error message pointed to an issue in one of our microservices, which we’ll call “Frobulator.” Frobulator was a critical piece of our stack that handled data processing and storage for our main application. It crashed periodically, but only on Kubernetes, not when run locally.

After hours of digging through code and logs, I finally found the culprit. There was a subtle race condition in how we were handling file writes. The Frobulator service wrote intermediate data to disk during its operations, and it seemed that sometimes, due to the way Kubernetes managed pod restarts, these writes weren’t completing before the pod went down.

It turned out that Kubernetes was killing the container before it could finish writing to the filesystem, which was causing issues with our distributed cache. The problem wasn’t obvious at first because everything worked fine on local machines or in standalone Docker containers, where there was no such risk of abrupt termination.

I filed a bug report and posted my findings on the Kubernetes GitHub repo, hoping that others might have run into similar issues. I also reached out to some colleagues who had more experience with Kubernetes, and we hashed out a few potential solutions. One idea was to use an in-memory cache instead of writing intermediate data to disk. Another was to add retries for critical writes.

Implementing these changes wasn’t trivial. We had to refactor parts of our Frobulator service, which meant extensive testing across various environments. But eventually, we managed to get it working reliably on Kubernetes. It felt like a victory—after all the debugging and refactoring, the pod was no longer crashing in production.

This experience reinforced my belief that while tools like Kubernetes are powerful, they also introduce new challenges. Debugging issues like this one can be more complex than with traditional setups because of how containers and orchestration systems work. We learned to be meticulous about ensuring our code could handle unexpected termination scenarios.

As I sit here now, the Frobulator service runs smoothly on Kubernetes, a testament to the hard work and persistence required when dealing with these modern infrastructure tools. Looking around, I see that similar issues are being discussed in various Slack channels and GitHub discussions. It’s clear that platform engineering conversations are starting to gain traction, as people struggle with the intricacies of running services at scale.

The tech landscape is evolving so rapidly, but there’s a sense of camaraderie among those navigating these new waters. The Panama Papers leak may have been making headlines on Hacker News, but for me, today was about figuring out why a piece of code wasn’t behaving as expected in a Kubernetes environment. In the end, it’s these small victories that make the journey worthwhile.