$ cat post/the-swap-filled-at-last-/-i-still-remember-that-ip-/-it-boots-from-the-past.md

the swap filled at last / I still remember that IP / it boots from the past


Debugging a Kubernetes Cluster with Etcd


June 16, 2014. A day like any other in the world of container orchestration and microservices. I woke up to the familiar hum of early morning emails and Slack notifications, but this time there was a peculiar one from our DevOps team. They had encountered an issue with our Kubernetes cluster that required my expertise.

The problem seemed straightforward at first glance: nodes were failing to communicate properly, causing services to go down. The symptoms pointed towards etcd, the distributed key-value store used by Kubernetes for maintaining the shared state of a cluster.

Etcd was still quite new in 2014, and its complexity was starting to show. We had only a handful of nodes at the time, but debugging this type of issue could be as frustrating as trying to solve a Rubik’s Cube with one eye closed. My first step was to get the logs from etcd, which are notoriously verbose yet often necessary for diagnosing such issues.

As I navigated through the logs, I noticed a pattern: there were repeated errors about leases expiring and leadership changes. Kubernetes uses etcd’s lease mechanism to ensure that only one node can make decisions at any given time. Clearly, this wasn’t happening as expected.

I dove deeper into the logs, trying to pinpoint when exactly these issues started. That’s where I found it: a sudden spike in network traffic around 2 AM on the previous night. It was like a bad episode of “House” — finding the root cause required stepping back and looking at all available information.

After hours of digging, I realized that our etcd cluster wasn’t properly configured for high availability. In a rush to get Kubernetes up and running, we hadn’t fully tested the HA capabilities, which included ensuring sufficient network redundancy and proper replication settings.

The solution? Simple but painful: reconfigure etcd with appropriate parameters, test thoroughly in staging, and then roll out the changes. I knew this was going to be an all-hands effort because every node would need to be updated and redeployed carefully.

As I sat there typing commands into my terminal, I couldn’t help but feel a mix of pride and frustration. Pride for working through such issues and frustration at how easy it is to overlook critical aspects during the rush to adopt new technologies. Kubernetes was still bleeding edge, and with great power came great responsibility.

The fix wasn’t just about etcd; it was about recognizing our own shortcomings in setup and configuration. It highlighted the importance of a robust infrastructure foundation, no matter how cutting-edge the tools might be.

Looking back at that day, it feels like a microcosm of what’s always true: technology is only as good as the people implementing it. We needed to be better prepared for the unexpected, more thorough in our tests, and more communicative about potential pitfalls.

That’s why I kept working on it, late into the night, until the cluster was healthy again. And even now, whenever I think of Kubernetes or etcd, I am reminded of that day—a reminder to always question what seems straightforward and never underestimate the complexity beneath the surface.


This post reflects a specific experience with Kubernetes and etcd in 2014, grounded in real work rather than just summarizing external news.