$ cat post/debugging-kubernetes:-a-real-world-encounter-with-pod-disruption-budgets.md

18JAN16

Debugging Kubernetes: A Real-World Encounter with Pod Disruption Budgets

January 18, 2016 was a Thursday. I woke up to emails from colleagues about the latest Kubernetes news. The pod disruption budget (PDB) feature had just been released, and it was supposed to be a game-changer for managing workloads in our growing cluster.

Background: Our Kubernetes Cluster

At this time, we were running a relatively small but rapidly expanding Kubernetes cluster. We had multiple teams deploying different applications, each with varying requirements. Some services needed high availability, while others could tolerate brief outages during deployments. Managing these discrepancies was proving to be a challenge.

The Problem: Unintended Pod Evictions

One of our applications, a critical data processing job, was experiencing intermittent failures due to pod evictions. The team initially thought it was an issue with the application itself, but as more and more pods were being interrupted during deployments, we realized that something else was at play.

We decided to dive deep into the logs and metrics. We found that even though our deployment strategy included rolling updates, some of the pods were still getting evicted prematurely, causing the entire job to fail. It was a frustrating loop: we kept optimizing our deployment strategy but couldn’t seem to break this cycle.

Kubernetes Pod Disruption Budget (PDB)

Enter PDBs. According to the documentation, these were designed to set constraints on how many pods of a specific type can be disrupted during a scale down or during a rolling update. Perfect for our scenario!

Implementing PDB

We started by defining our first PDB for the data processing application. We set it up with the minimum number of allowed pod disruptions, hoping that this would stabilize our job’s behavior. However, things didn’t go as planned.

Upon deployment, we noticed a significant increase in the duration of our rolling updates. The pods were no longer getting evicted immediately after they came online but remained active until the next update was scheduled. This unexpected behavior left us scratching our heads.

Debugging the Behavior

We spent hours debugging this issue. The Kubernetes documentation didn’t provide any clear guidance on how to diagnose PDB-related issues, and we found ourselves diving into the source code of the kubernetes controller-manager. It turns out that there were some nuances in the way the PDB was interacting with our rolling updates.

After much trial and error, we realized that the default eviction timeout for pods during a deployment had not been set correctly. By tweaking this setting, we managed to bring our deployment back to a more predictable state while still adhering to the constraints of the PDB.

Lessons Learned

Debugging Kubernetes can be like trying to solve a puzzle with missing pieces. This experience taught us that:

Documentation is Essential: Even with great documentation, it’s crucial to understand how things interact at a deeper level.
Test Thoroughly: Don’t assume that what works in one environment will work in another. Test your configurations extensively before deploying them widely.
Iterate and Refine: Kubernetes is evolving rapidly, so solutions that worked yesterday might not be the best fit today.

Moving Forward

With this experience under our belt, we became more confident in managing our Kubernetes deployments. We continued to experiment with different strategies and tools like Helm and Istio as they emerged. The tech landscape was changing quickly, but we were determined to stay ahead of the curve.

Kubernetes 1.4 had just been released, bringing new features that could help us further optimize our cluster. I took a deep breath, grabbed my laptop, and got ready for another day of tinkering with containers and code.

This blog post is a reflection on one of those frustrating but ultimately enlightening experiences in the early days of Kubernetes. The journey to understanding and effectively using tools like PDBs was a valuable lesson in persistence and learning from unexpected challenges.