$ cat post/the-kernel-panicked-/-a-shell-history-of-years-/-a-segfault-in-time.md

13JUL20

the kernel panicked / a shell history of years / a segfault in time

Title: Debugging Kubernetes Headaches in a Remote World

July 13th, 2020. I woke up to a world that was already very different from just three months ago. The initial shock of the pandemic had settled into a new normal, one where remote work was now the norm for most tech workers. For me, this meant another day of staring at my monitor in a virtual meeting room, but with a twist: our platform engineering team was facing some serious Kubernetes headaches.

We were working on a project to scale up our internal developer portal using Backstage and ArgoCD. The goal was simple: make it easier for developers to access the tools they needed, reduce friction, and provide them with a consistent experience regardless of where they were working from. But as always, the devil is in the details.

The first thing we noticed was that our deployments kept failing mysteriously. It wasn’t until I dug into the logs and pod events that I realized Kubernetes was returning 502 Bad Gateway errors on some requests. The culprit? A misconfigured service mesh.

I had to dig through our configuration files, a mix of YAML and Helm templates, to find where things went wrong. The service mesh had been added as an afterthought, with little integration testing between teams. It was clear that there needed to be more coordination and standardization around how we deployed these components.

This led us down the rabbit hole of Kubernetes complexity fatigue. We were deploying microservices, stateful applications, and all sorts of services that required a different level of expertise. Suddenly, our simple deployment tools started feeling inadequate. We found ourselves arguing about whether we should invest more time in making our CI/CD pipelines smarter or just stick with what we had.

One particularly frustrating morning, I was debugging an eBPF issue. It turned out to be a misconfiguration in the cgroups setup that allowed us to trace and optimize some of our services’ performance. The irony didn’t escape me that this niche technology, while powerful, added another layer of complexity to our infrastructure.

Meanwhile, on Hacker News, a massive Twitter hack was making headlines. As someone who deals with security every day, it sent chills down my spine. It made me realize how much we take for granted when it comes to the stability and security of these platforms. It also highlighted the importance of regular audits and robust security practices.

Another story that caught my attention was about Apple’s 30% cut on refunds. As someone who works closely with SRE principles, I couldn’t help but think about how much friction this kind of policy adds to the user experience. It’s a reminder that even in our tech-driven world, human factors and policies can significantly impact system design.

As we navigated through these challenges, it became clear that the era of platform engineering was here to stay. We needed more than just Kubernetes; we needed a robust set of tools and practices to manage the complexity. And with SRE roles proliferating, it felt like everyone in tech was suddenly learning about ops.

In the end, we managed to ship our internal developer portal with some improvements on the CI/CD pipeline and better service mesh integration. It wasn’t glamorous, but it felt good to have made progress. The journey taught me a lot about teamwork, problem-solving, and the challenges of building scalable infrastructure in a remote-first world.

That was my day, a blend of excitement and frustration. Debugging Kubernetes, arguing over best practices, and dealing with the realities of working from home. It’s a testament to how much our industry has changed, but also how much we still need to do to make it better.