$ cat post/the-function-returned-/-i-traced-it-to-the-library-/-the-stack-still-traces.md

27MAR06

the function returned / I traced it to the library / the stack still traces

Debugging Digg’s Scalability Woes

March 27, 2006

It’s been a whirlwind since the new year hit. The tech world is buzzing with all sorts of exciting developments: open-source stacks everywhere, Google hiring like they’re out to dominate the planet, Firefox launching in force, and the rise of Web 2.0. It’s clear that sysadmin roles are evolving—more scripting, more Python/Perl automation. I feel like I’m in the thick of it all.

Today, though, my mind is still on Digg. We’ve got a new feature rolling out—a user-updated leaderboard—and we’re seeing some unexpected load spikes that are pushing our servers to their limits. The site has been growing faster than any of us expected, and now, with this new feature, things have gone from manageable to… well, not so much.

Our stack is mostly LAMP, with a healthy dose of Python for the automation scripts. We’re running Xen hypervisor on our primary servers, but I’ve got a sinking feeling that it’s going to take more than just throwing hardware at this problem.

The Setup

We have a cluster of five servers, each running Apache, MySQL, and our custom-built Python app. Our main load balancer routes traffic between these nodes. Everything seems fine on the surface—Apache is configured with KeepAlive, and we’re using Memcached for caching to reduce the load on MySQL.

But something’s not right. We’ve got a few hundred thousand users, and as soon as we release the leaderboard feature, it looks like the entire site slows down. The logs show that Apache is starting to thrash—lots of requests hitting the server so fast, they can’t all be handled in time. It’s clear that our application layer needs some love.

Digging In

First stop: MySQL. I fire up top and see that mysqld is pegged at 100%. That’s no good. The queries are running slow; there are too many of them, and they’re not optimized. I start digging through the logs and realize we’ve got a lot of select statements hitting our database—most of which could be cached.

I start to implement caching for some of the more frequently accessed data, but even with that, we still can’t handle the load. It’s time to look at the application code. The custom Python app isn’t well-optimized; it has a lot of busy loops and synchronous database calls that block threads unnecessarily. I pull out my editor and start refactoring.

Refactoring and Scaling

I’ve been using Python for years, but sometimes it feels like I’m fighting against its own limitations. I decide to switch some of the more CPU-intensive parts of the application over to C or C++. It’s a risky move, especially since we’re running on Linux with Xen, but it’s necessary if we want to scale.

I start by identifying functions that are called frequently and spend a lot of time in the profiler. The first target is a function that processes user submissions. I rewrite this part using some low-level C code, and it improves things noticeably. With every optimization, I see the load on Apache decrease, but we still have a ways to go.

The Load Balancer

After optimizing our application and caching data where possible, we’re still hitting the limits of our hardware. It’s time for a change in strategy. We decide to split the load between multiple clusters. This means we need to upgrade our load balancer to handle more nodes efficiently.

I spend hours tweaking Nginx configuration files, trying different settings to see how they affect performance. Eventually, I hit upon an Nginx setup that works well—using keepalive with a reduced timeout and limiting the number of simultaneous connections per client.

The Aftermath

After all this work, Digg is running smoother than it has in months. We’re getting fewer complaints about slow load times, and our servers are breathing easier. It’s a good feeling to see that hard work paying off.

Looking back, I realize that this project taught me a lot about performance tuning and the importance of scaling your applications properly. The tech landscape is always changing, but the fundamentals—like understanding your stack, profiling bottlenecks, and optimizing code—remain constant.

Conclusion

As I sit here coding away, I’m reminded why I love working in ops and infrastructure. There’s a thrill in solving these kinds of problems that keeps me coming back day after day. The tech world is moving fast, but it’s the small victories like this one that keep us all going.

Until next time, Brandon