$ cat post/a-merge-conflict-stays-/-i-diff-the-past-against-now-/-i-saved-the-core-dump.md

12APR21

a merge conflict stays / I diff the past against now / I saved the core dump

Title: Kubernetes Complexity Fatigue Hits Home

April 12, 2021. I woke up to another day of endless debugging and optimization work on our growing Kubernetes cluster. It feels like we’re in the middle of a never-ending saga of microservices and sidecars, but today was particularly rough.

Last night, my team reported a strange issue: one of our critical services had started throwing 502 errors intermittently. This service handles payments for our users, so it’s not something we can afford to take down. I knew I needed to get to the bottom of this fast.

After pulling up the logs, I noticed some unusual spikes in memory usage on a few pods. That’s when my heart sank; I’ve been hearing about eBPF and how it could help with low-level debugging, but we hadn’t integrated it yet. It was time to put that knowledge into practice.

I started by looking at the cgroups to get an idea of what was going on. However, as soon as I opened up the logs, I realized that wasn’t cutting it—there were just too many moving parts and not enough context. That’s when I remembered a colleague had mentioned eBPF recently during our lunch discussion.

I quickly set aside some time to brush up on the basics of eBPF. Armed with my knowledge, I decided to use bpftool and bpftrace to trace system calls in real-time. After a few attempts at crafting the right script, I managed to get it working. The output was overwhelming but also incredibly helpful. It turned out that a large number of requests were being logged as “write” errors, indicating something was going wrong with our storage layer.

With more data in hand, I went back to my Kubernetes dashboard and started tracing the network requests from one of the problematic pods. I noticed a strange pattern: some requests were timing out while others completed without issues. This pointed me towards the ingress controller, which seemed to be misrouting traffic based on certain headers.

After a few hours of digging, I was able to identify a misconfigured service mesh sidecar that was caching responses improperly. Once I fixed the configuration and restarted the pod, everything started working smoothly again. The 502 errors stopped coming in, and our payment processing returned to normal.

This experience highlighted two things for me: first, how much eBPF can enhance troubleshooting at a low level, and second, that even with modern tools like Kubernetes, there’s still a lot of complexity lurking beneath the surface. As we move forward, I think it’s crucial for us to stay on top of these new technologies while also keeping in mind the importance of good old-fashioned debugging skills.

It was an exhausting but rewarding day. The tech landscape is moving so fast that it’s easy to get caught up in shiny new tools without taking a step back and assessing their true value. For now, I’ll keep honing my skills, both with Kubernetes and eBPF, ensuring we can handle the challenges of our growing infrastructure while keeping things simple when possible.

Until next time,

Brandon