Debugging Digg: A Day in the Life of a Sysadmin

May 30th, 2005 was a Tuesday and it started out like any other. I woke up early, made my usual bowl of oatmeal (I might have added a little sugar), and logged into IRC to catch up on the previous day’s happenings before heading over to the office.

Today had a few things lined up: help a developer with some Python scripting they were struggling with, then spend the rest of the afternoon debugging Digg. Yes, those guys were having issues again.

Breakfast: Oatmeal and Sugar (with a side of Sysadmin Stress)

The oatmeal was surprisingly good this morning. As I sipped my coffee, I noticed the Digg IRC channel was quite active. The server had started to slow down significantly, which wasn’t uncommon as their user base continued its rapid growth.

Helping Out with Python

I joined the developer’s channel and they were working on a script to scrape some data from an external API. They were running into a few issues with timeouts and parsing the JSON responses. I spent about 30 minutes helping them out, which involved tweaking their code, explaining how to use urllib more efficiently, and ensuring they had error handling in place.

By the time I finished up there, it was around noon and I knew I needed to focus on Digg’s issue before the day got away from me.

The Digg Issue

As soon as I arrived at my desk, I grabbed the server logs. The application servers were under heavy load, and the database queries seemed to be taking longer than usual. I started by reviewing the recent changes in their codebase—nothing major but a few tweaks here and there that could have introduced some performance issues.

I decided to do a round of top commands on all the web servers. The CPU usage was around 75%, which wasn’t too bad, but memory was at 90% with most processes using swapping space. Clearly something was wrong.

Exploring Further

I fired up htop, looking closely at the top-consuming processes and noticed that their caching mechanism seemed to be lagging behind. The Redis cache was underutilized compared to usual operations, which meant more hits were going straight to the database instead of being cached. This was leading to a lot of I/O waits and slower response times.

I made a quick fix by tweaking the cache settings in their config files—increasing some TTLs (time-to-live) and adjusting the number of connections to Redis. However, this wasn’t just a case of making tweaks; it required understanding why these changes had led to performance degradation. After discussing with one of the developers, we realized they had been running a cron job that was doing background database updates, which, in combination with Digg’s sudden surge in traffic, overwhelmed their setup.

Lessons Learned

This experience taught me two key things:

Documentation is Key: Having clear documentation on server configurations and expected behavior can save a lot of time.
Understanding the Underlying System: Knowing how each piece of your system works together—how caching affects queries, for instance—is crucial in troubleshooting.

By the end of the day, Digg was running more smoothly again. The team had learned from this experience, and I felt good about helping out. It’s a reminder that even with fast-growing services like Digg, sysadmin skills remain vital to maintaining performance and reliability.

That evening, as I settled into watching some TV (it was the first episode of Lost, if you’re curious), I couldn’t help but think about how much had changed since my days at Netscape. Open-source tools like Xen were gaining traction, and projects like Firefox were showing that the desktop web could still be relevant. But for now, it was just another day in the life of a sysadmin.