$ cat post/the-floppy-disk-spun-/-the-database-was-the-truth-/-the-shell-recalls-it.md

21OCT19

the floppy disk spun / the database was the truth / the shell recalls it

Title: On Scaling Infrastructure for Remote Work Amidst Kubernetes Complexity Fatigue

October 21, 2019 has become a bit of a landmark in my career. The day I found myself wrestling with the complexities of our Kubernetes cluster as we shifted towards remote-first operations due to the looming specter of COVID-19. It was an era marked by platform engineering formalizing and SRE roles proliferating, but also one where the sheer complexity of managing Kubernetes clusters began to weigh on us.

The Setup

We had been running a fairly robust Kubernetes cluster in our main data center. The idea was straightforward: containers for everything, so we could spin up new instances quickly and scale as needed. However, as the number of engineers increased, along with the complexity of applications we were deploying, maintaining the cluster started feeling like herding cats.

The Shift to Remote

The shift to remote work brought an added layer of complexity. Suddenly, all our infrastructure had to be accessible from any home network—often subpar compared to our office-grade connections. This meant we needed a robust internal developer portal (Backstage) to help manage and document everything. But with Backstage still in its early stages, documentation was often lacking, making it harder for new engineers to onboard.

Kubernetes Complexity Fatigue

As we tried to scale our cluster to meet the needs of remote work, I found myself staring down a mountain of issues. Kubernetes itself was already complex enough; adding the nuances of remote networking made things even more challenging. We were grappling with service mesh configurations, network policies, and just general Kubernetes best practices.

One particularly frustrating day, we experienced an outage where certain services were intermittently unreachable from some locations. It turned out to be a DNS resolution issue due to differences in how various home routers handled DNS queries. After several hours of digging through logs and trying different workarounds, I realized the need for a more robust monitoring solution. We started looking into Grafana and Prometheus to better track these kinds of issues.

SRE Practices

Speaking of which, we were also ramping up our SRE practices. The idea was to shift from purely operational tasks to ensuring that our systems could handle unexpected load or failure without breaking. This meant more automation, better documentation, and a culture shift towards treating infrastructure as code.

One of the tools we began experimenting with was eBPF (Extended Berkeley Packet Filter). It seemed like a promising way to gain deeper insights into system performance without the overhead of traditional monitoring solutions. We had to learn how to use it properly, but it promised a lot in terms of fine-grained control and real-time data collection.

The Journey Continues

As we continue to navigate these challenges, it’s clear that there’s no one-size-fits-all solution. Each day brings new lessons and trade-offs. While Kubernetes is powerful, its complexity can sometimes feel overwhelming. SRE practices are crucial for maintaining a resilient system, but they require significant effort.

The past few months have been a learning curve—both in terms of technical skills and in understanding the human element of managing distributed teams. The era we’re living through has forced us to adapt quickly and be more mindful of the tools and practices we choose.

Reflection

In the end, it’s about finding that balance between flexibility and control, automation and manual intervention. We’re not yet where we want to be, but each step forward brings us closer to a more robust and resilient infrastructure. The journey continues, one issue at a time.

This is just how I’ve been seeing things lately. There are certainly more ups and downs ahead, but that’s part of the fun, right?