$ cat post/february-25,-2019---sre's-dilemma:-scaling-remote-infra-in-a-pandemic.md

25FEB19

February 25, 2019 - SRE's Dilemma: Scaling Remote Infra in a Pandemic

Hey there,

February 2019 finds me knee-deep in the chaos of remote-first infrastructure scaling. It’s been a rollercoaster since the pandemic hit, but I’ve got some concrete examples and experiences to share.

The Setup

We’ve seen a significant shift towards remote work, with our company going from a mix of office and home to full-on remote. This means our internal developer portal—Backstage—is now handling more requests than ever. Our platform team is growing, but we’re still juggling the complexities that come with scaling infrastructure remotely.

The Debugging Sessions

Last week, I was deep into a debugging session for one of our services running on Kubernetes. We were seeing intermittent 502 errors, and I had to figure out what was causing them. After hours of digging through logs and metrics, it turned out that the problem was related to the way we were handling network connections in our eBPF programs. Turns out, the high load was causing some timeouts, which was leading to those 502s. It’s a classic case of over-engineering meeting unexpected edge cases.

The GitOps Evolution

Meanwhile, our team is also grappling with GitOps practices using ArgoCD and Flux. These tools are still maturing, but they’re becoming indispensable for managing our complex infrastructure. Recently, we hit a snag where Flux was failing to apply changes due to some misconfigured Helm charts. After a bit of troubleshooting, I realized the issue was caused by an outdated dependency in one of our chart files. Once that was fixed, things started flowing smoothly again.

The Pandemic’s Impact

The pandemic has only amplified these challenges. With everyone working from home, the lines between work and personal life have blurred. I’ve found myself staying up late to fix issues that pop up, often at 3 AM when my family is trying to sleep. It’s not always easy to maintain a good work-life balance in this setup.

Reflecting on My Failure to Build a Billion-Dollar Company

Speaking of challenges and growth, I recently read the HN post “Reflecting on My Failure to Build a Billion-Dollar Company” (2101pts, 353 comments). It resonated with me because it’s a reminder that success isn’t just about building the next unicorn. Sometimes, it’s more about the journey and learning from those challenges.

Lessons Learned

Remote Work Requires More Attention: Building a remote-first infrastructure isn’t just about moving servers to AWS; it’s about rethinking how you handle monitoring, debugging, and deploying.
Evolving Tools are Essential: Technologies like eBPF and GitOps tools are crucial for managing complex systems, but they require continuous learning and adaptation.
Personal Well-being Matters: Balancing personal life with remote work is key. Setting boundaries can help prevent burnout.

Looking Forward

As we move into March, I’m looking forward to continuing this journey of scaling our infrastructure while maintaining a healthy balance. The tools and practices are evolving rapidly, and it’s exciting to be part of that evolution.

Stay tuned for more updates as the story unfolds!

Cheers, Brandon