$ cat post/grep-through-the-dark-log-/-a-crontab-from-two-thousand-two-/-i-saved-the-core-dump.md

13DEC21

grep through the dark log / a crontab from two thousand two / I saved the core dump

Title: Debugging the Kubernetes Cluster with eBPF: A Day in the Life

December 13th, 2021 was another ordinary day. The world was buzzing with chatter about platform engineering and internal developer portals like Backstage, and folks were still grappling with the complexity of running Kubernetes clusters at scale. I found myself deep into a debugging session on one of our production clusters when an article about stealth bombers showing up on Google Maps caught my eye amidst the usual Hacker News flood.

The Incident

It all started when we received an alert from Prometheus about excessive CPU usage spikes on a couple of pods in a key service. We’ve been running Kubernetes for years, and while it’s generally stable, these kinds of spikes are always a red flag. Our stack typically includes a mix of custom microservices and some third-party components, so the usual suspects like overloaded processes or bad code paths were always on my radar.

The Tools at Hand

We had our trusty kubectl command-line tool, but it often feels too high-level for deep-dive debugging. That’s where eBPF comes in—extended Berkeley Packet Filter. This powerful kernel technology allows us to inject custom machine code into the kernel or other programs without having to modify them, making it perfect for tracing and profiling.

Digging In

I fired up bpftool and started tracing the CPU usage with bpftrace. The first step was to identify which processes were causing the spike. After a few lines of bpftrace code, I had a list of the offending containers:

# bpftrace -e 'tracepoint:syscalls:sys_enter_execve { @cnt = count(); }'

This gave me an idea that something in one of the pods was running a shell script or perhaps a complex process. But I needed more granular data.

eBPF to the Rescue

To get even deeper into what was happening, I decided to use bpftrace to trace function calls within these processes:

# bpftrace -e 'tracepoint:syscalls:sys_enter_execve /@cnt >= 10/ { execname = readstr($arg3); pid = $pid; }'

function:execve:1:pid=12345:execname=my-critical-service:arg1="/bin/bash -c 'sleep 100'":arg2=""

This revealed that a bash process was indeed running an infinite sleep command, which explained the CPU spike. It turned out to be a misconfigured cron job running in one of our services.

A Lesson Learned

Debugging with eBPF can be incredibly powerful, but it’s also easy to get lost in its complexity if you’re not familiar with it. I spent quite some time figuring out how to write the correct bpftrace queries and interpreting the output.

For future reference, here are a few tips for anyone looking into eBPF:

Start simple: Begin with basic tracepoints like system calls.
Read documentation: The BPF tracepoint manual is your friend.
Test in a safe environment first: Don’t inject code directly into production without testing it thoroughly.

A Brief Detour

As I was wrapping up, the news about Log4j RCE found started rolling in. I couldn’t help but chuckle—a good reminder that while Kubernetes can abstract away many complexities, we still need to keep an eye on our dependencies and libraries.

Conclusion

By the time I finished fixing the cron job and ensuring it wouldn’t happen again, the day was already well into the evening. Reflecting back on my day, I realized how much I rely on eBPF for deep-dive debugging in Kubernetes clusters. It’s a tool that, while complex, can save hours of time by providing insights directly from the kernel.

As I closed up my laptop and headed home, another day of ops work under my belt, I couldn’t help but wonder what other mysteries would unfold with the tools at our disposal. The tech world is always changing, and staying ahead requires constant learning and adaptation. But for now, I’m just glad to have found a solution that kept things running smoothly.

The end.