$ cat post/net-split-in-the-night-/-i-traced-it-to-the-library-/-no-rollback-existed.md

08OCT18

net split in the night / I traced it to the library / no rollback existed

Title: Kubernetes and the Great Container Convergence

October 8, 2018. I can barely believe it’s been a year since Docker was all anyone could talk about. Back then, everyone was fighting over containerization tools, but now… well, let’s just say there’s been some consolidation.

I’ve been spending most of my days working with Kubernetes and its ecosystem. It feels like every week brings a new piece to the puzzle: Helm for charts, Istio for service mesh, Envoy as the sidecar proxy. And don’t even get me started on serverless—Lambda and Fargate are gaining steam, but I still prefer rolling my own with a Kubernetes deployment.

I spent the last few weeks wrestling with an issue that was driving me crazy: intermittent networking problems in our staging environment. Pods would randomly lose their network connections to external services, causing timeouts and retries. It’s one of those problems that feels like it should be simple but turns out to be a rabbit hole of network policies, service accounts, and firewall rules.

I tried everything under the sun—updated my manifests, checked DNS, poked around with kubectl exec into pods—but nothing seemed to resolve it fully. The closest I got was making sure all my services were using hostNetwork: true in their Kubernetes config, which helped a bit but wasn’t ideal because of potential security concerns.

Then, just as I was about to give up and call the staging environment’s network a living hell, I stumbled upon netstat running inside one of our pods. It showed that connections were timing out due to excessive retransmissions—TCP packets being lost or delayed too much. This was an eye-opener; I hadn’t considered the impact of network policies on TCP behavior.

I dove deeper into the docs and found a network policy rule that seemed promising: adding allow-egress rules for specific services solved our issue! It turned out that some of our external dependencies had tight timeout settings, which were causing retransmissions to fail. By explicitly allowing traffic through these policies, we stabilized our staging environment.

This experience highlighted the importance of understanding all the layers involved in your deployment: from the network policy to the pod itself. Kubernetes is powerful, but it can also be a bit of a beast if you’re not careful. I think this episode taught me that while container orchestration tools like Kubernetes are great for abstracting away much of the complexity, they still rely on well-structured and carefully thought-out networking policies.

As we move forward with more complex deployments using Helm and service meshes, I’m going to have to pay even closer attention to these details. Maybe next time, I’ll get it right from the start!

In reflection, this blog post is a snapshot of my experiences in 2018, grappling with Kubernetes networking issues. The tech landscape was rapidly changing, with new tools and concepts emerging all the time. It’s always good to take a step back and write down what you’ve learned, even if it’s just for yourself.