$ cat post/memory-leak-found-/-i-diff-the-past-against-now-/-we-were-on-call-then.md

memory leak found / I diff the past against now / we were on call then


Title: Kubernetes Complexity: When Your Cluster Is a Problem


November 16, 2020. Another day in the life of a platform engineer. This month, the tech world buzzed with excitement over vaccines and a new direction for the US. Inside our company walls, we faced another challenge: managing a growing Kubernetes cluster that was becoming harder to handle.

Our team had been running services on Kubernetes for several years now, but as more teams joined and dependencies grew, the complexity started to show. Pods were crashing, deployments weren’t rolling out smoothly, and troubleshooting issues became a nightmare. It felt like we were in an arms race with our own infrastructure—just when things seemed stable, something would break.

One day, I found myself staring at a Jenkins build that had been stuck for over two hours. The logs showed repeated connection timeouts to our Postgres database running inside Kubernetes. A quick look at the dashboard revealed that all pods in the database namespace were unhealthy and restarting non-stop. My immediate thought was, “Oh no, not another deployment.” But there was something different this time.

The problem wasn’t just about a single service failing; it was part of a larger pattern that I needed to understand before jumping into quick fixes. So, I set aside the Jenkins build and started digging through the Kubernetes cluster logs with kubectl. After hours of sifting through data, I noticed an oddity in the pod crashes: they weren’t always related to database connections.

It turned out that we had a few services configured to use a shared Ingress controller. One service was misbehaving by sending too many requests to another, causing timeouts and cascading failures. This wasn’t immediately obvious because our monitoring tools didn’t give us the full picture of what was happening between the pods.

I realized that our Kubernetes cluster needed better visibility and more robust observability tools. We had been using Prometheus for metrics but lacked a proper dashboard that could correlate data across services. I argued with my team about the need to adopt a more comprehensive monitoring solution like Grafana, which would allow us to visualize how different parts of the application interacted.

The conversation was tough because our team was already dealing with limited resources and expanding responsibilities due to remote work requirements brought on by the pandemic. But we knew that without proper monitoring, we couldn’t address issues like this one efficiently.

After a heated discussion, we decided to go ahead with Grafana. Implementing it wasn’t easy; it required a significant amount of configuration and tuning. We spent weeks setting up alerts, dashboards, and integrations between Prometheus and other services like Loki for logging. It was hard work, but it paid off when we could finally correlate logs from different pods to identify the root cause of failures.

Looking back at that period, I can see how Kubernetes complexity fatigue had set in. The technology itself is powerful, but managing a cluster at scale requires thoughtful planning and continuous improvement. We learned that while Kubernetes makes deployment easier, it doesn’t come without its own set of challenges. Proper observability tools are essential for maintaining a healthy infrastructure.

This experience taught me the importance of proactive management over reactive troubleshooting. It’s about building resilience into your systems so you can handle unexpected issues more gracefully. The tech world may change rapidly, but some things stay constant—like the need to continuously improve and adapt our tools and practices to meet new challenges.


That was a day that shaped how I approached platform engineering for years to come. The lessons learned from dealing with Kubernetes complexity still resonate today as we continue to scale our infrastructure in response to growing demands.