$ cat post/kubernetes-growing-pains:-a-real-time-case-study.md
Kubernetes Growing Pains: A Real-Time Case Study
October 10, 2016 was a busy day at our company. We had just completed an intense week-long migration of multiple microservices to Kubernetes, and we were still feeling the after-effects of it. The entire engineering team was buzzing with a mix of relief and exhaustion. This was one of those days when you realize that while Kubernetes is a game-changer, it’s not without its growing pains.
We launched our initial Kubernetes cluster about two weeks ago, and the excitement was palpable. Our goal was to offload some of our container orchestration duties from Mesos onto Kubernetes for better scalability and easier management. We had planned for this migration like any other major change—researching best practices, creating a detailed migration plan, and setting up a staging environment.
The Setup
We had a relatively straightforward setup:
- Three master nodes with HAProxy for load balancing.
- A few worker nodes, each running Docker containers.
- A cluster of storage nodes using Ceph for persistent storage.
On the day of the migration, everything seemed to be going smoothly. We moved our first service over and it worked like a charm. By lunchtime, we had three services up and running without any issues. But as the afternoon progressed, the bugs started creeping in.
The Bugs
One of the first things we noticed was that one of our critical services kept failing to restart after a crash. We couldn’t figure out why it was behaving differently from our staging environment. After a series of debugging sessions and logs analysis, we realized that Kubernetes was using a different default command for restarting containers compared to Mesos.
In Mesos, the default command was exec /entrypoint. But in Kubernetes, it was just /entrypoint, which caused issues with certain entrypoints. We had to add an extra layer of configuration to make sure our services behaved as expected. It was a bit of a headache, but we managed to get around it.
Another issue was the network latency between our application nodes and storage nodes. This wasn’t immediately apparent during testing because our staging environment was in the same subnet. Once we started using Kubernetes across multiple subnets, the performance dropped significantly. We had to adjust our network policies and use more powerful machines for storage nodes to keep up with the demand.
The Joy of GitOps
One bright spot was integrating GitOps principles into our workflow. We set up a Git repository to manage all our Kubernetes configurations using kubectl apply -f. This allowed us to version control our cluster’s state and made rollbacks much easier. It also helped us keep track of changes more systematically.
However, we faced some challenges with the tooling. At this time, Terraform 0.x was still in its early stages, and we found that it wasn’t quite ready for prime time. We had to rely on manual commands like kubectl a lot, which introduced a level of variability that GitOps should have mitigated.
The Future
Despite the challenges, I’m excited about where Kubernetes is taking us. It’s still maturing, but its potential is enormous. As more companies adopt it, we’ll see better tools and best practices emerge. For now, though, we’re just happy to be on this journey with a tool that promises so much.
The next few weeks will likely bring more bugs and issues as we continue to refine our setup, but I’m optimistic. Kubernetes is far from perfect, but it’s already making our lives easier in many ways. We’ve learned a lot from this migration, and the experience has made us better engineers.
That’s my day in the life of Kubernetes. A mix of frustration, relief, and excitement as we navigate through the growing pains together.