$ cat post/the-swap-filled-at-last-/-the-database-was-the-truth-/-we-were-on-call-then.md

the swap filled at last / the database was the truth / we were on call then


Title: Remote First: The Pandemic’s Impact on DevOps and Platform Engineering


March 16, 2020. Today marks a turning point, one that will be inscribed in the history of technology and how we work. The world is shifting to remote-first infrastructure scaling, and it hits close to home as my team at Workday starts this journey.

Just last week, I was flying from LA to Seattle for an internal conference. Now, our platform engineering team has gone virtual overnight. Zoom meetings are the new norm, and we’re scrambling to support a rapidly growing number of remote workers. The infrastructure is struggling; everyone’s home Wi-Fi and laptops aren’t meant for this level of load.

One of the things I’m wrestling with right now is ensuring that our internal developer portal (Backstage) remains robust despite the spike in traffic from suddenly remote engineers. We’ve been using Backstage to catalog, manage, and serve APIs across our monolithic services, but we need to make sure it stays performant as more people rely on it.

Yesterday, I spent a good portion of my day debugging a performance issue with our Kubernetes cluster. It seemed like every pod was hitting its CPU limits, even though the load wasn’t that high in normal circumstances. After some digging, I realized that we were experiencing eBPF-induced bottlenecks. A recent update to one of our services had introduced an eBPF program that was too aggressive with packet filtering, and it was eating up a lot of CPU resources.

I argued against implementing eBPF for this particular service, emphasizing the potential for unintended performance hits. It’s a classic case of overengineering—a common pitfall in our industry. I reminded myself that simplicity is often better than complexity, especially when scaling to remote infrastructure where network conditions can be unpredictable.

Another challenge is the shift towards GitOps with tools like ArgoCD and Flux. We’ve been using ArgoCD for some time now, but it’s never faced a situation like this before. Our deployments are more frequent and distributed, which means we need to ensure that our GitOps practices are even stronger than ever. I’m seeing firsthand how crucial it is to have reliable automation when things get busy or chaotic.

The news from Hacker News today is a stark reminder of the real-world implications of our work. Stories like Zoom’s privacy issues and Honda removing touchscreens highlight the critical importance of security, user experience, and ethical considerations in technology. As we scale our infrastructure to support remote teams, these aspects become even more paramount.

We’re also grappling with the broader pandemic context. The WHO has just declared it a global pandemic, sending ripples through the tech industry as companies like NPM announce they are joining GitHub. It’s a testament to how interconnected our world is and how quickly events can reshape the landscape of technology and engineering.

As I write this, the team is still figuring things out. We’re learning as we go, adjusting to new challenges in real time. But one thing remains constant: the importance of resilience and adaptability in platform engineering. The shift to remote-first infrastructure may be uncomfortable at first, but it’s pushing us to rethink how we design systems that can handle unpredictability.

In the coming weeks, I’ll be reflecting more on these changes. How do they affect our approach to DevOps? What new tools or practices might emerge from this experience? And most importantly, how do we ensure that despite all the chaos, our infrastructure remains reliable and secure for everyone?

For now, it’s about keeping the lights on while we navigate these uncharted waters.