$ cat post/net-split-in-the-night-/-a-certificate-expired-there-/-the-port-is-still-open.md

20DEC04

net split in the night / a certificate expired there / the port is still open

Title: Debugging Digg’s Big Day

December 20, 2004. It’s hard to believe it’s already been over a decade since this day, but that’s the truth. I was in the middle of something rather urgent at my job back then, working on a platform that supported several online communities and news aggregators. One of those was Digg, which was starting to get a bit more attention than usual. As someone who had been tracking its growth for months, I knew this day was coming.

The Setup

Digg’s popularity had grown exponentially over the past few years, thanks in large part to its unique “diggability” feature where users could submit and vote on stories. This led to a rapid increase in traffic during certain times of the day—primarily during the morning rush as people were checking their feeds before starting work or during lunch breaks when folks had time to catch up on news.

Our platform was built around the LAMP stack (Linux, Apache, MySQL, and PHP), with a healthy dose of Perl for some of our internal tools. We used Xen virtualization to handle multiple instances of Digg’s environment, ensuring we could scale quickly during peak times without breaking the bank too much. The infrastructure was holding up pretty well so far, but as Digg gained more users, it started pushing the limits of what our setup was capable of.

The Incident

One morning, just before 10 AM, our monitoring system flagged a concerning trend: CPU usage on one of our Xen VMs supporting Digg had spiked dramatically. I quickly logged into the server to check what was going on. My first thought was that it might be due to an update or patch we had scheduled, but seeing as this happened at an unusual time and involved more than just a single machine, something else seemed to be at play.

After running some diagnostics, I found that one of our custom PHP scripts used for handling user votes was going haywire. It seems there was a race condition in the code where multiple votes were being processed simultaneously, leading to exponential growth in load and database queries. This caused Apache to thrash the CPU, making it harder for other processes to run smoothly.

The Fix

With the issue identified, I knew we had to act fast. First, I made sure to revert any recent changes that might have introduced this bug. Then, I started digging into how we could address the race condition without fully rewriting the script. The solution involved adding locks around the voting logic in our scripts and using a better data structure for tracking votes to ensure consistency.

To make it easier on ourselves moving forward, I wrote a small utility tool that would automatically generate those necessary lock statements based on function names or patterns in the codebase. This way, future developers wouldn’t need to worry about this kind of error as much.

The Aftermath

After deploying these changes, everything stabilized within an hour. We monitored closely for any signs of recurrence and found no issues. The event highlighted a few things:

The Value of Monitoring: Having robust monitoring in place allowed us to catch the problem early.
Race Conditions Are Real: They can happen even when you think your code is rock solid, so always consider them.
Automating Repetitive Tasks: Writing tools that automate repetitive tasks can save a lot of time and reduce human error.

Overall, it was a good learning experience for everyone involved. The tech landscape was changing rapidly, but being prepared to adapt and fix issues quickly was key. As the holiday season began in earnest, I couldn’t help but think about how much had changed in just a few years—and how much more would change over the next decade.

That’s it from this day in 2004. Debugging, learning, growing—those are the ongoing themes that keep things interesting.