$ cat post/netstat-minus-tulpn-/-a-port-scan-echoes-back-now-/-i-saved-the-core-dump.md

04DEC17

netstat minus tulpn / a port scan echoes back now / I saved the core dump

Title: Notes from a Snowy Night: A Kubernetes Saga

December 4th, 2017. The snow was starting to fall outside my window, like little diamonds on the roof of the world. Inside, I was in the thick of another Kubernetes debugging marathon that felt more like the Matrix—layer upon layer of abstraction, all wrapped up in a cold, dark night.

A Night in the Pod

Last week, we rolled out a new service using Helm charts to our production Kubernetes cluster on Google Cloud Platform (GCP). Things looked smooth at first: the services deployed without any hitches, and everything was green in our Prometheus dashboard. But then, around 2 AM, alarms started going off like firecrackers.

We noticed that one of our critical services was starting to fail. The pods were crashing and being restarted rapidly—too rapidly for us to keep up with debugging. The logs showed a cryptic error: “Failed to open socket.” It looked like some kind of network issue between the service and its database, but nothing in the networking setup seemed off.

A Trip Down Memory Lane

I had worked on this cluster before when it was just a couple of nodes. Now we were running dozens of services, with complex interdependencies. I pulled up my notes from the initial setup—those endless hours spent configuring GKE, deploying Helm, and setting up Istio for service mesh.

I decided to start debugging by checking the network policies. These policies had been one of the first things we set up to ensure secure communication between pods. But with so many services and so much traffic, it was hard to keep track of everything. I needed a way to visualize what was happening.

Enter: Calico

Calico is an open-source networking solution that has gained traction as Kubernetes becomes more mainstream. I remembered reading about it when it first emerged but never had the chance to use it in production. Now, with this cluster, it seemed like a good time to dive in and see if it could help us visualize and troubleshoot our network issues.

After an hour of setting up Calico for debugging purposes (which is still not as straightforward as I would like), I was able to see the traffic patterns between pods and services. The “Failed to open socket” errors made more sense now: there were dropped packets due to some overly restrictive policies we had applied inadvertently.

A Lesson in Kubernetes

This experience reinforced a few things for me:

Documentation is Key: We had a lot of undocumented changes and decisions that got buried over time. This is something I’ve been advocating for better documentation practices with our team.
Complexity Awaits: Kubernetes, as powerful as it is, comes with complexity. Tools like Calico can help manage some of the intricacies but don’t replace careful planning and understanding.
Iterative Debugging: In situations where things go awry, take a step back, reassess, and use the right tools to debug iteratively.

Conclusion

By 4 AM, we had identified the issue, refined our network policies, and the service was up and running smoothly again. The snow outside provided an odd contrast to the urgency inside as I wrapped up for the night. It’s days like these that remind me why I love working with Kubernetes—it’s a challenge, but it also offers endless opportunities to learn and grow.

Sometimes, life just piles on in ways you don’t expect. But when you’re dealing with a system as complex as Kubernetes, sometimes you need those snowflakes to fall just right—because they might be the key that unlocks the next layer of complexity.