$ cat post/the-prod-deploy-froze-/-the-terminal-remembers-me-/-a-segfault-in-time.md

11NOV19

the prod deploy froze / the terminal remembers me / a segfault in time

Title: When DevOps Met SRE Met Platform Engineering

On a typical Tuesday in November 2019, I found myself wrestling with the interplay of DevOps, Site Reliability Engineering (SRE), and platform engineering. It’s been an interesting journey so far, but there are still plenty of battles to be won.

The Context

Back then, we were just starting to formalize our platform engineering practices, and everyone was buzzing about internal developer portals like Backstage. I remember arguing with my colleagues about whether SRE should become a separate discipline or remain part of the DevOps team. Meanwhile, remote work was becoming the new normal, driven by the looming shadow of what would soon be a global pandemic.

A Bug Hunt

One particularly frustrating day, we were dealing with an outage that seemed to be causing our users major pain. It wasn’t straightforward—a classic case of misbehaving state in Kubernetes pods. The logs looked like a scrambled jigsaw puzzle, and the only way forward was through brute force. I spent hours tracing back through log entries, trying to find the moment when things went south.

I eventually stumbled upon an obscure issue where a critical environment variable wasn’t being set correctly due to some misconfiguration in our deployment pipeline. Fixing it required digging into Helm charts and re-rolling several deployments, but once it was resolved, everything fell back into place. It’s moments like these that remind me of the importance of having robust logging and monitoring in place.

The DevOps vs SRE Debate

Speaking of logs, I spent a good chunk of time arguing with my team about whether we should formalize our roles further or keep them more fluid. On one hand, the idea of SRE as a separate discipline made sense—teams could specialize in reliability and automation without having to worry too much about application development. But on the other hand, I believed that the DevOps model still had its advantages; it encouraged cross-functional teams to work together seamlessly.

We eventually decided to take a hybrid approach, with some dedicated SRE roles but also keeping everyone else well-versed in reliability practices. The goal was to ensure that every developer understood the importance of writing resilient code and setting up good monitoring and logging from day one.

Remote Work Realities

Speaking of cross-functional teams, the shift to remote work had its challenges. We found ourselves constantly battling with video conferencing tools like Zoom for better collaboration and communication. The lack of physical presence sometimes made it harder to get everyone aligned on critical decisions. It wasn’t easy, but we managed to set up regular virtual standups and code reviews, which helped us stay connected.

The Tech Landscape

On the tech front, eBPF was gaining a lot of traction. I started experimenting with it in one of our projects, finding it fascinating how much control you could get over kernel behavior without writing traditional kernel modules. However, it’s still not a technology for the faint of heart; we had to go through several iterations before getting everything stable and secure.

ArgoCD was also maturing quickly, giving us more robust tools for managing our Kubernetes clusters. We started using it in one of our critical projects, and while there were some growing pains, the benefits of having a GitOps-driven approach became clear over time.

A Personal Note

Looking back at that November, I can see how much has changed. Back then, we were still figuring out the right balance between DevOps and SRE practices. Today, it feels like the landscape is more settled, but there’s always something new to learn. Whether it’s dealing with unexpected bugs or navigating complex toolchains, every day in engineering is a journey.

For now, I’m content knowing that we’re making progress, even if it means tackling challenges head-on. And who knows what the next big thing will be? Maybe it’ll be the eBPF magic, or maybe it’ll just be another day fixing logs and deploying updates. Either way, it’s an exciting time to be in tech.

That’s where I was back in November 2019, trying to navigate the ever-changing world of platform engineering and beyond.