$ cat post/january-2021:-a-month-of-infra-messes-and-microservices-mayhem.md

January 2021: A Month of Infra Messes and Microservices Mayhem


It’s been a while since I’ve written up a post on my personal blog, but the end of last year really hit home with me. There’s so much to catch up on, especially in the world of platform engineering and infrastructure. Here’s where we stood in January 2021:

The Infra Mess

We were deep into Kubernetes land, but it had its own set of messes. We were wrestling with some serious complexity fatigue—basically, it felt like every day was a new PITA (Problem in Troubleshooting Area) to fix. One particularly gnarly issue involved a service that kept crashing at random intervals. After hours of digging through logs and metrics, I realized the culprit was a simple but frustrating eBPF issue.

The eBPF Incident

eBPF had really taken off as an interesting way to extend kernel functionality without touching the kernel code itself. We had a service that used an eBPF program for some clever network monitoring. But something wasn’t quite right, and it was causing the crashes. After some back-and-forth with the Linux community (a few cups of coffee later), we finally identified the issue: our eBPF program was trying to manipulate data it shouldn’t have been accessing.

The solution? A bit of refactoring and a lot of klog analysis. It’s these small, tedious fixes that keep us on our toes. The lesson here is simple: always double-check what your eBPF programs are doing. Kubernetes complexity can lead to all sorts of fun edge cases!

Remote First at Warp Speed

With the full throttle of remote work, we found ourselves ramping up our infrastructure to support a more distributed workforce. This meant scaling our internal developer portals (Backstage) and SRE roles across multiple locations. The challenge was ensuring that everyone had access to the tools they needed without sacrificing security or performance.

Scaling Backstage

One particular challenge with Backstage is keeping it performant while adding new features. We’ve been using a combination of caching, service mesh strategies, and custom alerts to keep things running smoothly. It’s a lot like trying to maintain a tightrope walk—every tweak counts!

GitOps and SRE

ArgoCD and Flux were maturing nicely, but there was still that pesky “complexity” label hanging over them. We argued long into the night about whether we should adopt a more declarative approach or stick with our imperative scripts. The pro-declarative side won in the end—mostly because it’s easier to manage and audit.

SRE Roles

Speaking of arguments, SRE roles were proliferating within our organization. It’s clear that SRE principles are becoming more mainstream as we look for ways to improve reliability and maintainability. But this also means more people dealing with the ops side of things—another reason why Backstage is so important.

The Hacker News of January

If you’ve been following Hacker News, it’s been a month filled with drama. Robinhood’s stock trading shenanigans were front-page news, along with WhatsApp’s data sharing ultimatum. Then there was the Element chat app being banned from Google Play, and of course, the U.S. Capitol protests. It all feels like a wild ride.

My Take

In the tech world, it’s easy to get caught up in these big stories, but for us, it’s really about making our infrastructure more resilient and reliable. Whether we’re debugging eBPF programs or scaling Backstage, every day is filled with challenges that keep things interesting.

January 2021 was a month of infra messes and microservices mayhem. But hey, at least I had a good excuse for missing a few meetings—my computer was busy dealing with all the eBPF crashes!