$ cat post/debugging-digg:-the-day-the-link-posting-beast-crashed-my-server.md
Debugging Digg: The Day the Link-Posting Beast Crashed My Server
June 27, 2005. A typical day at the office, except this time something was very different. Our little web application had just had its launch into the world of Web 2.0 and the open-source revolution. I was an early platform engineer in a startup that was slowly but surely becoming a household name: Digg.
The Setup
Back then, our stack was pretty standard for the era—LAMP (Linux, Apache, MySQL, PHP). We had a handful of servers running on Xen virtualization and were using Perl scripts to handle some of the automation. The core app itself was written in PHP with a sprinkle of Python. We were still figuring out how to scale, but we knew we needed to be more agile.
The Incident
It started innocently enough. I was sitting at my desk, sipping on an espresso and trying to plan for the next release. Suddenly, my server monitoring tool (we used Nagios) pinged me with a critical alert. My heart raced as I clicked over to see what was going on.
Nagios Alert: CPU Usage 100%
I quickly jumped into my shell session:
top -b -n 1 | grep -i "diggs"
And there it was, php5 eating up all the CPU. I looked at the process list and saw a ton of identical entries—PHP scripts running user submissions.
The Analysis
I had written some simple Perl scripts to handle background jobs, but this didn’t look like any of those. These were regular PHP scripts running in Apache’s mod_php mode. How did so many get triggered? I checked our database logs and found that a new feature we launched a few weeks ago was triggering all user submissions on Digg.
This was the first time I had to deal with an avalanche of requests—thousands, maybe even tens of thousands per minute. We hadn’t anticipated this level of traffic. It’s like trying to catch snowflakes in your hands, only it was all code and servers.
The Fix
After a quick brainstorming session with the team (we had a small but growing dev ops crew), we came up with a plan:
- Rate Limiting: We needed some way to throttle incoming requests without completely breaking the app.
- Queueing System: Integrate a message queue like Gearman or RabbitMQ so that our background jobs could be processed in batches instead of all at once.
We implemented rate limiting first, using Apache’s mod_evasive. This helped, but we needed something more robust for the long term. We also added logging to track which submissions were being hit hardest and why.
The Aftermath
The incident highlighted a few key lessons:
- Scalability Planning: We needed better planning for traffic spikes.
- Infrastructure Resilience: Our servers had to handle much more than we anticipated.
- Team Communication: Effective communication is crucial when dealing with major incidents.
In the weeks that followed, we worked on refactoring our codebase and improving our deployment processes. The lessons learned from this incident were invaluable as Digg continued to grow.
Looking Back
That day taught me a lot about the reality of running real services in an open-source world. Web 2.0 was still young, but the pace of change was relentless. Debugging Digg on that fateful June day wasn’t just about fixing servers; it was about learning how to build and maintain resilient systems.
As I wrote this, I couldn’t help but chuckle at how much has changed since then. But even now, as a seasoned platform engineer, I still look back on those days with a mix of nostalgia and gratitude for the challenges they presented.