$ cat post/the-buffer-overflowed-/-i-ssh-to-ghosts-of-boxes-/-the-deploy-receipt.md

08MAY06

the buffer overflowed / I ssh to ghosts of boxes / the deploy receipt

Debugging the Digg: A Day in the Life of an Ops Guy

May 8, 2006 - The sun was just starting to set as I walked into work. Today’s going to be a big day for us. We’ve got a major update rolling out, and everyone’s on edge. You see, Digg is getting more popular every day, and we need to make sure that our infrastructure can keep up with the demands of 150k users.

The Setup

We’ve been using Xen for our virtualization needs, running Apache and MySQL on a mix of custom scripts and a dash of Python. Our stack isn’t fancy; it’s simple, but reliable. We’ve got four servers in total: two for the web front end, one for MySQL, and another that serves as both a backup and a staging server. We’re not exactly bleeding-edge, but we get the job done.

The Problem

Late this afternoon, I noticed something was off with our monitoring systems. The graphs were showing higher than normal CPU usage on the web servers, and the MySQL server seemed to be under more load than usual. Normally, this would just be part of a typical Friday rush, but today felt different.

I decided to head over to the server room. As I walked in, the smell of old tech hit me—cables everywhere, faintly humming cooling fans, and the occasional whine of hard drives spinning up. The lights were dimmed, which is standard for late afternoons when not much else is happening around here.

Digging In

I pulled up my SSH client to check on things remotely first. The CPU usage graphs looked worrisome. I logged into one of the web servers and ran some basic commands:

top

The output showed that it was Apache that was consuming most of the resources, with multiple processes hanging around with high CPU percentages.

Next, I checked the MySQL server for any unusual queries or high load:

SHOW FULL PROCESSLIST;

Nothing stood out immediately. The usual suspects weren’t there—no long-running SELECTs that should be indexed better, no runaway transactions. It seemed like a classic case of slow code somewhere in our application stack.

The Script

Remembering an old script I wrote to monitor and log slow queries, I fired it up:

mysql -u root -p`cat /etc/mysql.secret` -e "SHOW FULL PROCESSLIST" | grep -v 'Id\|sleep' > /var/log/slow_queries.log 2>&1

After a few minutes, something caught my eye. There was one slow query that kept popping up:

SELECT * FROM posts WHERE user_id = 345678 ORDER BY timestamp DESC LIMIT 10;

This looked familiar because it’s a common call in our front-end pages for displaying the latest posts from a particular user.

The Fix

I knew where this query was coming from. We had a caching layer, but sometimes that cache would miss on newer posts. It turned out that one of the developers forgot to update the cache when he added a new feature. Easy enough to fix with an UPDATE statement:

UPDATE posts_cache SET ... WHERE user_id = 345678;

Once I ran this, the server’s load dropped significantly. The CPU usage normalized on both the web and MySQL servers.

Reflection

As the sun set outside, the server room quieted down, and we finally breathed a sigh of relief. This was just another day in the life for us at Digg, but it reminded me why I love this job—figuring out problems, solving them, and making sure everything keeps running smoothly.

The tech landscape is changing fast, but today’s work was as real and gritty as ever. From open-source stacks to web 2.0, from Xen to Python scripts, the tools may evolve, but the principles remain the same: keep your infrastructure robust, debug like a pro, and stay one step ahead of the next big problem.

There you have it—a day in the life of an ops guy, dealing with the everyday challenges that come with keeping a popular site running smoothly.