$ cat post/real-talk-on-scaling-remote-infrastructure-in-a-post-pandemic-world.md

Real Talk on Scaling Remote Infrastructure in a Post-Pandemic World


October 25th, 2021. A date that marks another milestone in the tech industry’s ongoing evolution. I’m sitting at my home office—a corner desk with a view of a not-so-impressive neighborhood park—and reflecting on what it means to manage infrastructure for developers who are also spread across this very same landscape.

The Context:

Platform engineering was really coming into its own during these months. Internal developer portals like Backstage were becoming the norm, and SRE roles were proliferating more than ever before. COVID had pushed everyone to remote work, driving a need for robust, scalable infrastructure that could support our growing dev teams spread across continents.

In my role as an engineering manager, I found myself wrestling with Kubernetes complexity fatigue. The promise of container orchestration was there, but the reality of managing it at scale became increasingly daunting. ArgoCD and Flux GitOps were helping, but they came with their own complexities that required a lot of fine-tuning to get right.

The Problem:

One recent Monday morning, my team reported issues with our internal developer portal—Backstage was down. It’s not the first time we’ve seen this, but it always feels like an indictment of everything you’ve done so far. The outage wasn’t just about a service failing; it was a reminder that our infrastructure was still脆弱的。


原文:One recent Monday morning, my team reported issues with our internal developer portal—Backstage was down. It’s not the first time we’ve seen this, but it always feels like an indictment of everything you’ve done so far. The outage wasn’t just about a service failing; it was a reminder that our infrastructure was still脆弱的。

修改后:One recent Monday morning, my team reported issues with our internal developer portal—Backstage was down. It’s not the first time we’ve seen this, but it always feels like an indictment of everything you’ve done so far. The outage wasn’t just about a service failing; it was a reminder that our infrastructure was still lacking in resilience.


The root cause turned out to be a misconfiguration in our Kubernetes cluster. A simple typo in the YAML file had caused a critical pod to fail, and since we didn’t have robust monitoring in place, no one noticed until it was too late. The fix wasn’t complicated—just fixing the typo and ensuring better logging—but the process of identifying and resolving the issue highlighted the importance of having reliable monitoring and alerting systems.

The Hack News Inspiration:

Reading through Hacker News that week, I couldn’t help but see parallels between the stories and my own challenges. The Facebook-owned sites going down was a stark reminder of the fragility of large-scale infrastructures. IoT hacking and rickrolling might seem trivial, but they underscored the importance of security at all levels. And the MacBook Pro 14-inch and 16-inch saga? Well, that just made me smile as I typed away on my own MacBook Air.

One article that really resonated with me was “Things I’ve learned in my 20 years as a software engineer.” It’s easy to get caught up in the latest tech trends and forget about the basics. As someone who has been in this game for over two decades, it’s humbling to see how much the industry has evolved but also how some fundamental principles remain constant.

The Lessons Learned:

This incident taught me a few things:

  1. Robust Monitoring is Crucial: We need better monitoring and alerting systems to catch issues before they impact users.
  2. Resilience Matters: Our infrastructure needs to be more resilient, able to handle unexpected failures without causing widespread outages.
  3. Documentation is Key: Clear documentation of our configurations will help prevent simple mistakes like the typo we had.

As I sit here writing this, reflecting on what went wrong and how we can do better, I’m reminded that every day in tech involves a mix of excitement, frustration, and hard-earned wisdom. The era of platform engineering, SRE roles, and remote work continues to shape our industry, and it’s up to us to adapt and thrive.

Stay tuned for more real talk on the journey of building robust, scalable infrastructure in a world that keeps changing at lightning speed.


Feel free to adjust any part of this post if needed!