$ cat post/the-old-datacenter-/-that-script-still-runs-somewhere-deep-/-the-secret-rotated.md

24APR17

the old datacenter / that script still runs somewhere deep / the secret rotated

Title: Kubernetes Chaos: A Day in the Life of an Ops Engineer

April 24, 2017. Another Tuesday at my desk, staring at a screen filled with red text and frustrated mutterings. Kubernetes was winning, but it wasn’t exactly doing so in a way that made my life any easier. I found myself debugging clusters more than I ever had before.

The Setup

We were running multiple Kubernetes clusters, each one an island of its own, with a mix of self-managed nodes and managed services like Google’s GKE. Every time we needed to deploy a new application or update an existing one, it was a multi-step process that involved kubectl commands, Helm charts, and a lot of manual intervention.

The Problem

Today, the issue at hand was related to our monitoring setup using Prometheus and Grafana. Our microservices were spiking with errors, and we needed to investigate why. But first, I had to get Kubernetes into a stable state.

A Day in Hell

8:00 AM - Debugging Services

I started by checking the status of all my services across clusters using kubectl get pods. Halfway through, I noticed an issue with one of our production environments. It wasn’t just a simple misconfiguration; there was a pod that had been stuck in CrashLoopBackOff for hours.

After some digging, it turned out that we were hitting a race condition between two services: Service A was trying to connect to Service B before it was fully initialized. I applied a temporary workaround by setting up a readiness probe on Service B with a 10-second delay. It wasn’t ideal, but it allowed us to keep the system running until we could dig deeper.

9:30 AM - Cluster Migrations

While addressing the service issue, my colleague informed me that one of our clusters was going down for maintenance in an hour. This meant I had a window of time to migrate some of our critical services from the old cluster to the new one before we lost access.

I quickly ran through the steps needed: backing up Persistent Volumes (PVs), creating new PVs, and updating the deployment configurations. It was tedious, but it went smoothly thanks to Helm templates. I also had to ensure that all the necessary annotations were applied correctly to make sure everything would work as expected in the new cluster.

10:30 AM - Prometheus Alerts

With the migrations mostly done, I shifted my focus back to monitoring. The Grafana dashboard was still showing high error rates for one of our services. Digging into the Prometheus metrics, I found that there were multiple pods with different errors.

I started debugging by checking the logs on each pod using kubectl exec and tail -f. It turned out that a recent change in how we handled database connections was causing issues. We had updated our connection pool settings, but hadn’t fully tested the changes yet. I rolled back the configuration to its previous state and watched as the error rates started to normalize.

The Aftermath

By noon, everything seemed stable again. I took a moment to reflect on how much Kubernetes had changed in just a few months. It was still an evolving platform with more moving parts than any of us initially anticipated. But the fact that we were able to recover and keep our services running despite these issues was encouraging.

Lessons Learned

Automation is Key: Using tools like Helm for templating made the migration process less error-prone.
Monitoring is Everything: Early detection through Prometheus helped isolate and fix issues before they became critical.
Incremental Changes: We should have tested our database connection changes more thoroughly before rolling them out.

Looking Forward

With Kubernetes becoming more mainstream, I knew we needed to start thinking about platform engineering in a serious way. The next step would be to standardize our cluster configurations and create better automation around service deployments and migrations.

As the day wrapped up, I couldn’t help but feel both frustrated and excited about where this journey was taking us. Kubernetes might not always make life easy, but it certainly challenges me to think more deeply about how we build and operate complex systems.

That’s a wrap for today. Time for a cup of coffee and maybe some coding during my lunch break. Stay tuned for the next adventure in Kubernetes land!