$ cat post/a-shell-i-once-loved-/-the-socket-never-closed-right-/-it-boots-from-the-past.md

07JUN10

a shell I once loved / the socket never closed right / it boots from the past

Title: Chaos Engineering vs. Continuous Integration: An Old Argument in a New Light

On June 7, 2010, I was deep into my daily grind as an ops engineer, trying to balance the demands of production uptime with the relentless push for new features and improvements. The term “DevOps” was still gaining traction, but it felt like every day we were finding ourselves arguing between two camps: those who believed in the traditional continuous integration (CI) approach, and those who were pushing chaos engineering as a way to ensure our systems could handle failures.

Back then, I remember being part of a group that had just started experimenting with Chef for configuration management. We were on the bleeding edge, trying to automate everything from deployment scripts to environment setup, but there was still this nagging doubt in my mind about whether we were really doing enough to test our systems under stress.

That same month, Netflix announced its chaos engineering initiative. I remember reading their blog posts with a mix of awe and skepticism. How could you possibly inject failures into your production system and expect everything to just work? But as I delved deeper into the concept, I couldn’t ignore the logic behind it. We had been hit by numerous issues that seemed like they should have been caught in our testing environment but weren’t. Maybe, just maybe, injecting simulated faults would help us identify weak spots before we launched a new feature.

Around this time, I was also wrestling with how to integrate Chef into our CI pipeline. We were using Jenkins for our build processes and wanted to automate the deployment of changes as well. The idea was that developers could make changes in their local environments, run tests on them, commit those changes, and have Jenkins automatically deploy everything to a staging environment. From there, if all tests passed, we would promote it to production.

But this posed an interesting challenge: how do you test the system thoroughly enough before deploying? The answer seemed obvious—integrate Chef directly into our CI pipeline so that every change went through automated configuration testing. If any part of the infrastructure broke due to a change in the codebase, we could catch it early and avoid having a cascading failure during a production deploy.

The debate within the team was intense. On one side were those who argued that continuous integration should be enough. They believed that by running comprehensive test suites, we could catch any issues before deployment. On the other side, some of us felt that we needed to go further and simulate real-world conditions—what if a network connection failed? What would happen if a database went down?

I found myself in the middle, trying to find common ground. We started by integrating Chef into our CI pipeline and setting up automated checks for each deployment. But even with this, there were still instances where things didn’t go as planned. For example, we had a situation where an application server failed during a deployment, causing a cascading failure in the database. This highlighted the need to simulate real-world conditions more aggressively.

So, I decided to set up a small Chaos Engineering lab within our infrastructure. We started by injecting simulated failures into our staging environments and monitoring how everything behaved. It was painful at first—seeing systems go down because of something that “shouldn’t” have happened. But it also taught us invaluable lessons about how robust our architecture truly was.

By the end of June, we had managed to automate much of our CI process and set up a basic framework for chaos engineering. The argument between continuous integration and chaos engineering wasn’t as black-and-white as I thought. It turned out that both were necessary but served different purposes. Continuous integration helped us catch issues early in development, while chaos engineering ensured our systems could handle real-world failures gracefully.

Looking back on that period, it’s clear to me now that the tools and technologies of 2010 laid the groundwork for what we consider modern DevOps practices today. The emergence of platforms like Heroku, NoSQL databases, and cloud services like AWS were transforming how applications were built and deployed. But at the heart of it all was a simple question: how do we ensure our systems can handle failure?

That’s still a question I grapple with every day, but now, thanks to chaos engineering, I know there are ways to approach it that go beyond just CI alone. June 7, 2010, marked the beginning of my journey in understanding this crucial aspect of building resilient systems.