$ cat post/the-daemon-restarted-/-the-queue-backed-up-in-silence-/-the-merge-was-final.md

23JUL18

the daemon restarted / the queue backed up in silence / the merge was final

Kubernetes and Me: A Year in Review

July 23, 2018, feels like a milestone. The container wars are won, and everyone’s moving to Kubernetes. Helm is gaining traction, Istio is emerging from the shadows, and serverless seems more like buzz than reality. I’m deep into my third year as an engineering manager, with platform engineering becoming the hot new thing in ops circles.

The Day It All Hit Home

I spent most of this month debugging a massive production outage for our key application. It’s the kind of day when you feel like you’ve hit rock bottom. The app went down hard, taking half of our company’s core services with it. After hours of frantic digging and countless “why isn’t this working?” moments, I finally pieced together that it was a Kubernetes misconfiguration causing an unexpected service outage.

The Misadventure

I had been tweaking the pod labels to better manage resources and deployments. In my excitement to optimize resource usage, I made some changes without fully testing the implications across all environments. As is often the case, the production environment didn’t play nicely with the staging one. A side effect of this misconfiguration was that our application pods started being scheduled on nodes that were not supposed to handle certain types of traffic.

When the load spiked unexpectedly, these pods went into an unhealthy state and brought down the entire service. The pain of losing a service in production was palpable, but the real kicker came when I had to explain why this happened to my team—those who hadn’t been involved directly, but whose work impacted the outcome.

The Lessons Learned

Test Everything: This isn’t just about regression testing or integration tests; it’s about thoroughly understanding how changes will affect the entire system.
Document Your Assumptions: When you make a change, document what you expect to happen and what might go wrong. It’s like writing an epic tale before setting out on your quest.
Review Changes in Stages: Don’t jump straight into production without staging it first. I should have had a better plan for staged rollouts and rollback strategies.

Moving Forward

While the outage was painful, it spurred me to rethink our deployment practices. We started integrating more robust pre-deployment checks and added automated testing scripts. I also pushed for more detailed change documentation and review processes.

Kubernetes is a powerful tool, but like any other technology, it requires careful handling. The shift from monolithic applications to microservices has made ops a lot more interesting, but not without its challenges. It’s easy to get caught up in the hype of new tools and technologies while forgetting that the fundamentals still matter—like good old-fashioned testing.

The Future

As we move further into 2018, I’m excited about where platform engineering is headed. GitOps is starting to gain some traction, and tools like Helm and Prometheus are making it easier to manage and monitor our infrastructure. But at the same time, I’m wary of over-relying on any single solution.

Kubernetes will continue to evolve, and with that comes new challenges and opportunities. I plan to keep a close eye on developments in this space—especially around observability and resiliency.

That’s my take for today. The tech landscape is always shifting, but the lessons we learn from our mistakes are what make us better engineers. Hope you found it useful!