$ cat post/september-20,-2021:-a-tale-of-scaling-remote-infrastructure-in-the-pandemic.md

20SEP21

September 20, 2021: A Tale of Scaling Remote Infrastructure in the Pandemic

September 20, 2021. It feels like just yesterday we were all huddled together, trying to figure out how to keep our servers running amidst the chaos of a global pandemic. Now, with remote work becoming the norm, it’s time to reflect on what we’ve learned and where things stand.

This month, I spent a lot of time working on our platform at Scaleway, focusing on scaling our infrastructure to support more remote engineers. It’s funny how much a good old-fashioned ops issue can feel like a personal struggle when you’re trying to juggle everything remotely from home.

One of the biggest challenges we faced was keeping our internal developer portal up and running as more developers shifted their work online. We’ve been using Backstage for quite some time, but with everyone working from home, the load on our servers spiked way beyond what we anticipated. Our initial setup was no longer enough to handle the increased traffic without significant downtime.

I spent a few days digging into the logs and metrics to understand where the bottlenecks were. It turned out that the DNS resolution times were becoming an issue—ping times to some of our services started creeping up, causing delays in the UI responsiveness for developers. This was especially frustrating because our network team assured me everything was fine on their end.

I dug deeper into eBPF and wondered if it could help us optimize the networking stack further. It’s amazing how much you can do with just a few lines of code to tweak kernel behavior. After some experimenting, I managed to reduce DNS resolution times by about 20%, which helped stabilize our portal’s performance significantly.

But performance wasn’t the only issue. We also had to deal with increased demand on our databases and storage systems. As more developers started relying on our platform for CI/CD pipelines, we saw a surge in API calls. Our Postgres instances were hitting their limits during peak hours, leading to timeouts and slow queries. To address this, I set up some monitoring alerts to preemptively scale out our database cluster when needed.

One of the most interesting pieces of tech I’ve been exploring is ArgoCD. We decided to give it a try for managing our Kubernetes deployments across multiple environments. Initially, there were some growing pains as we worked through how best to integrate it with Flux and our existing infrastructure. However, once we got everything set up, we saw significant improvements in how quickly changes could be rolled out across our clusters without downtime.

Alongside all this, I’ve been keeping an eye on the Kubernetes complexity fatigue that seems to be setting in for many teams. With every new version comes a flood of shiny features and breaking changes. It’s easy to get caught up in the hype cycle and forget about the basics—like making sure your storage classes are properly configured or ensuring that you’re using secrets management best practices.

Speaking of which, I’ve been thinking a lot about security lately. With more developers working from home, we needed to ensure our CI/CD pipelines were secure and isolated enough that any missteps wouldn’t compromise the entire platform. We started implementing more stringent access controls and reviewed all our pipeline configurations to make sure they weren’t leaving gaps.

And then there’s the legal stuff—like the Epic vs. Apple case. As someone who deals with GDPR compliance, I couldn’t help but be curious about how this would play out. It was a bit of an existential crisis for some of us, wondering if our work was going to become too complex and unwieldy.

All in all, it’s been a whirlwind month—lots of debugging, arguing, and learning. But that’s what keeps things interesting. As we continue to scale our infrastructure and support more remote workers, I’m excited about where this will lead us next. Whether it’s optimizing DNS resolution times or implementing better secrets management, every challenge is an opportunity for growth.

Until next time,

Brandon