$ cat post/december-21,-2020:-a-year-of-remote-chaos-and-platform-engineering.md
December 21, 2020: A Year of Remote Chaos and Platform Engineering
Today marks a day when the world felt like it was at an intersection of many worlds. On one side, we had the ongoing pandemic pushing companies to adapt and scale their remote work infrastructure. On the other, we were seeing formalization in platform engineering, with tools like Backstage starting to take hold. Meanwhile, Kubernetes complexity fatigue was setting in as teams wrestled with the day-to-day operations of running clusters.
I woke up early this morning, feeling the weight of a month that had been filled with technical challenges and team-wide discussions. We were hitting some rough patches with our platform, but also making significant progress on new projects.
Debugging Kubernetes Cluster Issues
One of the biggest headaches we faced was related to persistent issues in our Kubernetes clusters. A few weeks back, I noticed a spike in the number of pods that kept crashing and failing to restart. Digging into the logs, I found a common theme: “context deadline exceeded.”
This usually points to network connectivity issues or resource constraints, but after running some diagnostics, it became clear we were dealing with something more complex. It turned out to be an intermittent DNS resolution issue caused by a misconfigured service mesh. Fixing this required coordinating with the networking team and updating our Istio configuration.
It was frustrating to spend so much time on what felt like a simple network issue, but at least it taught me the importance of having good logging and monitoring practices in place.
Embracing Internal Developer Portals
At work, we’ve been experimenting more deeply with internal developer portals, specifically Backstage. The goal is to create a single source of truth for all our infrastructure and applications. This month, I led a discussion on how we could improve the portal’s usability by integrating it with our CI/CD pipelines.
The feedback from developers was mixed, but overall they liked the idea of having everything in one place. We’re still working through some kinks, like ensuring that changes to the Backstage configuration are deployed without disrupting services. It’s a learning process, and every day we see new ways to improve our internal developer experience.
SRE Roles Evolve
The role of Site Reliability Engineers (SRE) continues to grow in importance. At our company, SREs are not just about handling incidents; they’re becoming more involved in the design and development process from the beginning. This shift has led to some interesting debates around who owns certain aspects of infrastructure.
For example, last week we had a heated discussion on whether the application team or the platform team should be responsible for configuring network policies. In the end, we decided that it was best to have both teams involved so that everyone understands the implications of their decisions.
Remote Work Challenges
Remote work has been challenging, especially when it comes to building and maintaining trust within a distributed team. We’re using tools like Slack and Zoom extensively, but sometimes nothing beats a face-to-face conversation over coffee (or at least a video call where everyone shares a screenshot).
I miss the days of spontaneous hallway conversations that can lead to great ideas. But I also appreciate the flexibility remote work offers. It allows me to balance my personal and professional life better, which is something I didn’t realize was possible before.
Looking Forward
As we head into 2021, I’m excited about what lies ahead. eBPF looks like it will continue to gain traction as a powerful tool for performance monitoring and optimization. ArgoCD and Flux are maturing nicely, which means our infrastructure deployments will be more automated and reliable.
But the most important thing is that we keep learning and growing as a team. The past year has been tough, but it’s also shown us the resilience of technology and people in challenging times.
[End of Post]