Debugging the AWS Reboot on September 3, 2012

September 3rd, 2012 was a memorable day for me and my team. We were working hard to ship our new feature, but unbeknownst to us, there was a massive issue brewing in the cloud that would take down parts of our infrastructure.

It all started as we were testing out the latest version of Chef on our servers. We had been using Puppet for years, and moving over to Chef was part of our DevOps transformation journey. But something just wasn’t right. Our configuration changes seemed to be causing issues instead of fixing them. I spent most of my day chasing down a mysterious error that kept popping up, but every time I thought I had it, the problem would manifest somewhere else.

Meanwhile, on another machine, we were seeing intermittent outages in our application logs. It looked like some network packets weren’t making it to their intended destinations. I pinged around the office asking if anyone was seeing similar issues, and I got a few nods from people. It seemed like we weren’t alone in this.

That’s when our ops team notified us about an AWS issue that had been causing downtime for many users across various services. The “AWS Reboot” as it became known, was causing connectivity issues between Amazon VPCs and instances. We were part of the affected users.

As I dove deeper into the problem, I found out that our Chef recipes weren’t handling network configuration changes properly. A change in AWS’s routing or DNS setup had caused our application to fail, even though we thought it was just a simple config issue. It turned out we were overwriting some critical settings, leading to our services being unreachable.

We quickly switched to using the AWS console to check our security groups and route tables manually. This helped us understand the extent of the problem—our instances weren’t able to communicate with each other properly due to misconfigured routes.

To fix it, I rolled back recent Chef changes, which wasn’t ideal but necessary to stabilize things. We then worked on writing better validation checks in our recipes to prevent this from happening again in the future. It was a painful process, but we emerged stronger for having learned these lessons.

Looking at what was going on outside of work that month, it was interesting to see how the DevOps movement was gaining traction. We were living through some of its principles every day—writing better automation and configuration management tools. And just as I was wrestling with Chef, Netflix’s chaos engineering was also becoming more prominent in the industry.

In the end, while we spent a lot of time debugging this issue, it served as a valuable lesson on the importance of thorough validation and testing in infrastructure changes. It wasn’t glamorous, but it was an important part of our journey to better DevOps practices.

As for the hacker news stories that month, they all seemed so far removed from what we were dealing with. The discussions around learnable programming and user interfaces felt like a different world compared to debugging routing tables on AWS. But in the tech industry, you never know when something seemingly unrelated might come back to bite you.

Anyway, there’s always another day to debug or argue about best practices. I guess that’s what makes this job so exciting—there’s no shortage of challenges and lessons to learn.