$ cat post/first-commit-pushed-live-/-the-abstraction-leaked-everywhere-/-the-signal-was-nine.md

21SEP20

first commit pushed live / the abstraction leaked everywhere / the signal was nine

September’s Debugging Frenzy: Kubernetes Cluster Chaos

Today marks a significant shift in how we handle infrastructure at our company. It’s been an eventful month, especially with the push towards more platform engineering and SRE practices. I thought I’d jot down some of the technical challenges and learnings from the last few weeks.

It was September 21st, 2020, and just as the sun was setting over our bustling office building, my pager started beeping like crazy. Our Kubernetes cluster was in a state of chaos. Services were going down left and right, causing a bit of a ruckus on our internal developer portal (Backstage). The team was already spread thin with remote work due to the ongoing pandemic, so every outage was extra stressful.

The root cause? A misconfigured eBPF program that was introduced in one of our newer clusters. This little snippet had been quietly running for days before it finally bit us hard. What we thought was a simple performance optimization ended up causing massive CPU spikes and resource exhaustion. The funny part is, I’d written the original script months ago while trying to debug an issue with network traffic. Back then, I was just looking for quick wins without fully considering the long-term impact.

The Debugging Journey

I dove into the logs, tracing back the symptoms. It wasn’t hard to spot that something was wrong; our metrics were off the charts. The eBPF program was supposed to measure packet sizes and drop them if they exceeded a certain threshold. But as it turned out, we had accidentally enabled it for all pods without proper limits. This resulted in every single pod getting hit with these packets, leading to a domino effect of resource exhaustion.

I spent the better part of an evening manually disabling the program on each node and pod. It was tedious work, but necessary. As I toggled switches and watched the metrics stabilize, I couldn’t help but think about how easy it is to get caught up in quick fixes without proper testing.

Lessons Learned

This incident underscored a few key points for us:

Proper Testing: We need more rigorous testing of new tools before they go live. Especially with eBPF and other powerful technologies, it’s crucial to understand the full implications.
Documentation: While I had written comments in the code explaining its purpose, we needed better documentation on how to handle and disable such programs. This could help prevent similar issues in the future.
Team Collaboration: SRE practices are important for cross-functional teams like ours. More frequent collaboration between engineering and platform teams can lead to better decisions and quicker resolution times.

Moving Forward

As I write this, we’re starting to incorporate these lessons into our development process. We’ve added more thorough testing phases and improved documentation around eBPF usage. Our team is also exploring ways to integrate tools like Backstage more seamlessly into our daily operations, making it easier for everyone to contribute to platform engineering.

The chaos of September 21st has left a lasting impression on me—especially the realization that even with all the best intentions, quick fixes can backfire. But in the end, these experiences shape us and make us better engineers and team members.

It’s been a rough few weeks, but there’s always something to learn from each setback. I’ll keep pushing forward, hoping to avoid similar headaches in the future. After all, every challenge is an opportunity to grow.

Until next time, Brandon