strace on the wire / a shell history of years / a segfault in time

Title: March 15, 2021 - When SRE Meets Platform Engineering and Pandemic FOMO

March 15, 2021

It’s been a strange few months. The tech industry was buzzing with platform engineering efforts, internal developer portals like Backstage were gaining traction, and everyone was dealing with the complexities of running services in Kubernetes clusters. Then, of course, there was the remote-first work boom driven by the pandemic.

I remember one morning, I had just finished my Zoom call with the team when a Slack notification popped up from our SRE channel. It was yet another alert about a misbehaving service on our production cluster. As someone who wears both an engineering and platform hat, I found myself in the familiar yet frustrating position of trying to balance day-to-day ops work with long-term platform improvements.

We were using ArgoCD for our GitOps setup, which had been great at keeping our services up to date with the latest codebase, but it wasn’t without its quirks. One particularly stubborn service kept failing to sync and redeploy after a code change. It was one of those “stuck” issues that make you feel like you’re scratching an itch that’s just out of reach.

I spent hours digging through logs, running kubectl commands, and even tried some eBPF magic (it’s always good to keep your toolbelt stocked with tricks). But for a while, it felt like I was chasing my tail. The service would go from red to green, only to turn back to red after a few minutes. It was infuriating.

Finally, in a moment of desperation, I decided to take a step back and look at the broader context. This issue wasn’t just about one service; it was part of a larger challenge we were facing as we scaled our operations in a remote-first environment. The infrastructure needed more robust monitoring, better automation, and clearer separation between development and production.

That’s when I had an epiphany: perhaps the solution lay not just with my current setup but also in aligning SRE practices more closely with platform engineering goals. We needed to build a more resilient foundation that could handle both our growing codebase and the increasing complexity of managing services at scale.

I started brainstorming ideas for how we could refactor some of our existing processes. I proposed using Flux CD for GitOps, which had gained traction in the community as a more robust alternative to ArgoCD. It would allow us to have better visibility into what was happening with each deployment and help automate more of our pipeline.

After much discussion and some healthy debate within the team (yes, it’s happened to me too, folks), we decided to give Flux CD a try. We set up an initial pilot project where I could test out these changes in a controlled environment before rolling them out across the board. The results were promising; not only did it help us resolve that pesky stuck service issue, but it also made our overall deployment process more reliable.

Meanwhile, the news about the Ubiquiti breach and other cybersecurity incidents kept me on edge. We had been fortunate to avoid any major breaches or data leaks, but I couldn’t ignore the growing awareness of security best practices in every part of our tech stack. It reminded me that while we were making progress, there was always more to do.

And then, just as I was starting to feel a bit settled into this new balance between platform and ops, along came the GitHub name change saga. I chuckled to myself, thinking about how quickly we forget about the day-to-day struggles in tech when something big shakes things up.

But hey, isn’t that what makes our jobs interesting? The constant learning, the challenges, and the moments of breakthroughs. As someone who straddles both engineering and platform worlds, I found this period particularly rich with opportunities to grow and improve.

So here’s to March 15, 2021—another day in tech, full of frustrations, learnings, and a sprinkle of FOMO as we all grapple with the ever-evolving landscape of platforms, SRE practices, and the daily grind of making things work.

Until next time, Brandon