$ cat post/a-segfault-at-three-/-the-endpoint-broke-on-staging-/-i-kept-the-old-box.md

02AUG21

a segfault at three / the endpoint broke on staging / I kept the old box

Title: August 2, 2021 - Ephemeral Containers and the SRE Conundrum

This past month has been a whirlwind. I’ve been knee-deep in some of our infrastructure’s most critical components—containers, SRE practices, and developer portals. The tech world is buzzing with buzzwords like “SRE,” “Backstage,” and “eBPF,” but it’s the real ops work that keeps me going.

Debugging a Dilemma: Ephemeral Containers

One of our services, which handles user authentication, had been running smoothly for years. We’ve always relied on persistent containers—those big, bloated Docker images that stick around even when there’s no active workload. However, we recently started experimenting with ephemeral containers in our CI/CD pipelines.

Ephemeral containers are a game-changer. They start fresh, load the latest code and dependencies, run the tests, and then shut down. It reduces the time between code change and deployment to zero, as there’s no manual intervention required for container images to be built and deployed.

However, this simplicity came with its own set of challenges. We started seeing issues where stateful information was lost mid-execution because ephemeral containers didn’t have the necessary persistence layer. Debugging these issues felt like chasing ghosts—restarting the container could sometimes make a problem disappear or reappear.

We spent hours arguing about whether we should stick to persistent containers for stability, or switch completely to ephemeral ones to save time and resources. The pros of ephemeral containers are clear: faster iterations, less manual intervention, and no stale state lingering around. But the cons aren’t as straightforward—state management can be tricky, especially with complex services.

The SRE Conundrum

Speaking of stability, our SRE team has been on high alert. As we scaled up our remote infrastructure to accommodate more people working from home due to the ongoing pandemic, we faced a new set of challenges. Kubernetes clusters were starting to show signs of complexity fatigue—too many services, too much manual intervention in deployments, and not enough automation.

We decided to dive headfirst into ArgoCD and Flux GitOps. The idea was simple: sync our applications from code to production seamlessly. However, the reality is that implementing these tools correctly requires a deep understanding of both your infrastructure and your deployment processes.

There were moments when the GitOps strategy clashed with our existing SRE practices. We had to rethink how we handle deployments—should every commit be immediately pushed live? Or should we have a staging environment where changes are tested before being rolled out?

The SRE team was initially resistant, worried that these new tools would introduce more variables and potential points of failure. But as we started seeing the benefits in terms of reduced manual effort and improved stability, they began to come around. The key was striking a balance between automation and human oversight.

Learning and Evolving

As I reflect on this month, it feels like I’ve been navigating through a maze where every turn brings new challenges. From the complexity of ephemeral containers to the evolving practices in SRE, each decision has been tough but necessary.

One lesson that stands out is the importance of being adaptable. The tech landscape changes so rapidly that what worked yesterday might not work today. Embracing change means constantly learning and iterating—whether it’s adopting new tools or simply reevaluating existing ones.

And sometimes, it just comes down to picking your battles wisely. We’ve decided to stick with persistent containers for now, knowing full well we’ll revisit the topic soon. For now, our focus is on making sure everything runs smoothly until we can make a more informed decision.

That’s where I am today. The road ahead looks promising but also challenging. There’s still so much to learn and do. Let’s keep pushing forward!