$ cat post/kubernetes'-reign:-a-day-in-the-life-of-an-old-school-platform-engineer.md

Kubernetes' Reign: A Day in the Life of an Old School Platform Engineer


March 13, 2017. A day like any other on the Kubernetes front lines.

I woke up to a series of notifications from my beloved kube-state-metrics. I swear it’s like having an ex-girlfriend that still sends you texts in the middle of the night. “Is this deployment still stuck? Why is that node green and not yellow?” It’s a love-hate relationship, but hey, somebody has to keep track of all those pods.

My first order of business was to deal with a Kubernetes cluster outage at one of our staging environments. We had been pushing hard on getting our services ready for the big release next week. The cluster went down in the middle of the night, and the logs were just a mix of cryptic errors and my team’s frustration.

“Node ‘worker-3’ is stuck in an unschedulable state,” the logs screamed at me. That’s code for “something’s broken here.” I checked the node and found a disk issue. Simple enough, right? Wrong. The disk was failing, but the cluster wasn’t reporting it properly. Kubernetes was still trying to schedule new pods on that node, causing the unschedulability.

I spent the better part of my morning writing a custom script that would watch for nodes that were stuck and alert us if they had been in an unschedulable state for more than 5 minutes. It wasn’t pretty, but it worked. I learned that while Kubernetes is incredibly powerful, it’s not perfect, especially when you’re dealing with complex disk issues.

Later, my team had a meeting to discuss the upcoming release. We were going through the checklist: are all the services up and running? Are there any edge cases we need to handle? The conversation turned to Helm. Some folks thought that using Helm would make our lives easier by abstracting away some of the Kubernetes complexity. Others argued it was unnecessary bloat, especially since most of our apps were stateless.

I chimed in with a compromise: “Let’s use Helm for new services but stick with plain manifests for our legacy applications. We can still benefit from version control and templating without adding extra layers.” It wasn’t the perfect solution, but it was a pragmatic one that everyone could agree on.

As I was wrapping up my day, I found myself thinking about the term “serverless” again. It’s been all over Hacker News this month, with people debating its merits. But here’s the thing: serverless is just another buzzword for managed services—AWS Lambda, Google Cloud Functions—that abstract away much of the infrastructure. While it can be useful in certain scenarios, I still find myself preferring a good old fashioned Kubernetes deployment when I need full control.

The world of DevOps and platform engineering seems to shift every day, but one thing remains constant: there’s always something new to debug or optimize. In today’s era of Kubernetes and serverless, we have more tools than ever before, but the underlying principles of good ops—monitoring, logging, automation—aren’t changing.

As I close out my laptop for the night, I’m reflecting on how far we’ve come with Kubernetes. It’s a powerful tool that has won the container wars, and it’s only getting better. But as always, there’s still room for improvement, whether that’s better observability tools or more intuitive ways to manage stateful applications.

For now, I’ll just enjoy the quiet of an empty office and a few beers before tackling another day in the Kubernetes trenches.


That’s my take on March 13, 2017. Hope it gives you some insight into how we were thinking about DevOps and platform engineering back then.