$ cat post/packet-loss-at-dawn-/-the-database-was-the-truth-/-the-pipeline-knows.md

27JAN20

packet loss at dawn / the database was the truth / the pipeline knows

Debugging Kubernetes with eBPF: A Day in the Life

January 27, 2020. Another day, another problem with our k8s cluster. This time it’s a pesky network issue that just won’t go away. I’ve been fighting this particular beast for weeks now, and it’s starting to feel like we’re in an endless loop of debugging. But today, something clicks.

The Problem: Network Latency Woes

We had some strange latency issues with one of our microservices. It was intermittently slow, but only on certain requests. I spent days pouring over logs and metrics, trying to find the culprit. No luck.

Then, a friend mentioned eBPF. I knew it was gaining attention in the DevOps community for its ability to do deep packet inspection and kernel tracing, but I hadn’t actually used it before. Time to dive in.

Enter eBPF

I set up bpftool on one of our nodes and started poking around with bpftrace. The first few commands were a bit like hitting my head against the wall—I couldn’t figure out how to get any meaningful output. But after a few tries, I managed to write a simple tracepoint that logged network packets going in and out of my service.

sudo bpftrace -e 'tracepoint:net:sock_sendmsg { printf("Packet sent %llu\n", (uint64_t)cur_cpu); }'

I ran this on the pod where I suspected the issue was happening. Then, using kubectl, I triggered a network request that would hit my service and voilà! The command outputted packets as they were leaving the container.

Uncovering the Culprit

With eBPF, I could now see exactly what was happening at a very low level. It turned out the latency issues weren’t caused by our application, but rather by network congestion on another pod that shared the same host. The service was being throttled due to resource limits imposed on the node.

This wasn’t immediately obvious from any of our metrics or logs because we only had visibility at the container level. eBPF allowed me to peek directly into the kernel and observe the packets as they were being processed.

The Solution

With this insight, I was able to adjust the resource allocation for the pod sharing the host with my service. It wasn’t a perfect solution—sometimes you have to balance performance across multiple services—but it did significantly reduce the latency issues.

The real win here is that eBPF provided us with an incredibly powerful tool to debug issues we couldn’t solve before. It’s like having a superpower for DevOps work, but with all the responsibility of managing something so close to the metal.

Reflecting on the Day

This day was both frustrating and exciting. Frustrating because I spent months dealing with this issue and finally cracked it open. Exciting because eBPF opened up new possibilities for debugging and monitoring our systems at a level we haven’t had before.

As the day wrapped up, I found myself wondering how many other issues like this were out there that we just couldn’t see clearly enough. With tools like eBPF, maybe we can start to peel back the layers of complexity in Kubernetes and get to the root causes more quickly.

Parting Thoughts

It’s times like these that make me appreciate the community behind open-source technologies. Without sharing knowledge, experimenting with new tools, and helping each other, we wouldn’t be making progress as fast as we are today.

So here’s to eBPF—may it continue to help us solve the hard problems in Kubernetes and beyond.

Happy debugging!