$ cat post/strace-on-the-wire-/-the-database-was-the-truth-/-the-cron-still-fires.md

28MAY07

strace on the wire / the database was the truth / the cron still fires

Title: Debugging a Critical Bug in Our New Hadoop Cluster

May 28, 2007 was just another day for me and the tech team at my startup. We were building an infrastructure around Amazon’s EC2 and S3 services to serve our growing user base. The cloud vs. colo debates raged on as we balanced the cost and reliability of hosting in the cloud against traditional data centers.

That morning, I got a call from one of our ops guys who was panicking. Our new Hadoop cluster, which had just been deployed last week, was acting up. It seemed like some nodes were dropping out unexpectedly, causing us to lose critical data. This wasn’t good—Hadoop is supposed to be fault-tolerant, but we were seeing failures that shouldn’t have happened.

I quickly logged into our EC2 instances and started running diagnostics. The first thing I did was check the logs for any error messages from Hadoop. What I found there only made me more confused—the error messages weren’t very descriptive. It was like trying to solve a mystery without enough clues.

After a few hours of digging, I decided to reach out to some of my network contacts who had experience with Hadoop. One of them suggested that the issue might be related to the way we were launching our instances from Amazon’s Elastic Block Store (EBS). Some people had reported problems with EBS performance and stability in production environments.

I knew this was worth investigating further, so I started setting up a test cluster using EBS volumes to replicate what we had on production. As expected, it behaved similarly, dropping nodes unexpectedly. This ruled out some of the usual suspects like overloaded nodes or misconfigured Hadoop settings.

At this point, I felt a mix of frustration and determination. We were depending heavily on our Hadoop cluster for processing large datasets in real-time. Losing data here could mean serious downtime for us. So, I decided to take a more methodical approach by monitoring network traffic using tcpdump to see if there was something going wrong at the packet level.

After hours of analyzing packets and cross-referencing with Hadoop logs, I finally found the culprit: our DNS setup. It turned out that we were using a custom DNS resolver for internal services, which wasn’t playing well with EBS volumes. The DNS resolver would sometimes return incorrect IP addresses or take too long to resolve names, causing our nodes to drop and restart.

This discovery was both infuriating and enlightening. It made me realize how important it is to test every aspect of your infrastructure thoroughly—especially when you’re depending on third-party services like AWS. I also learned the hard way that sometimes the most critical bugs can come from seemingly unrelated components in your stack.

In the end, we fixed the DNS issue and everything stabilized. We had a bit of downtime during the transition, but no data was lost. This experience taught me to be more vigilant about potential hidden issues when deploying complex distributed systems like Hadoop on cloud platforms.

Debugging this bug solidified my belief in the power and fragility of building infrastructure at scale. It also highlighted how quickly technology changes can impact your operations—just a few years later, GitHub would launch and change so much about the way we work together as developers. But for now, I was just happy to have solved our critical issue.

That’s a look back at one of those days in 2007 when bugs could still surprise you and testing was everything. Tech moves fast, but sometimes it’s the old-fashioned debugging that reveals the most about your infrastructure and team.