$ cat post/the-monolith-ran-/-we-documented-nothing-then-/-the-stack-still-traces.md

31MAR08

the monolith ran / we documented nothing then / the stack still traces

Title: Debugging a Server Farm in the Great Recession

March 31, 2008 was just another day when I rolled into work, but there’s something about this time that feels like it should be recorded. The financial crisis was starting to simmer under the surface, and tech hiring had slowed down. But on that particular Tuesday, my server farm was going haywire.

The problem started last night. At around 3 AM, our production servers began acting up. CPU usage spiked, memory usage ballooned, and suddenly, half of our services were refusing requests. I checked the logs, but nothing jumped out at me. This was a classic case where the only thing you could do was dive in headfirst.

I pulled up the server monitoring dashboard on my screen. The graphs showed peaks and valleys, indicating some kind of periodic behavior. But what? I grabbed a spare laptop and fired up SSH to one of our servers. top revealed a process with an unusually high CPU usage. It was our logging service, which had grown out of control.

I tried killing the process but nothing happened—obviously, there were more instances running. My first instinct was to investigate further by adding some logging around the code that launches these processes. I scribbled down the steps in my notebook:

Identify the parent process – ps -ef | grep <logging-service>.
Check for any zombie processes – ps auxwww | grep defunct (didn’t find any).
Inspect the logs more closely – /var/log/app.log.

I spent an hour or so digging through logs and trying different commands, but nothing was sticking out as obviously wrong. My colleague mentioned that our load balancer might be sending too many requests to the logging service, which could explain the sudden spike in usage.

We decided to use ab (Apache Bench) to simulate a large number of concurrent connections and see if it reproduced the issue. After setting up a configuration file for 10k requests per second, I ran:

ab -c 10000 -n 50000 -s 60 <logging-service-url>

The output was enlightening—our service couldn’t handle that many simultaneous connections. The CPU and memory usage spiked to levels we hadn’t seen before.

Back in the monitoring dashboard, I saw our servers starting to throttle incoming requests due to the load. This explained why some services were failing—too much traffic for them to process without crashing.

With this new insight, I knew what needed to be done. We would need to scale the logging service horizontally and maybe even look into optimizing its code to handle more concurrent connections. In the meantime, we could add a throttling mechanism in our load balancer to prevent such high spikes from happening again.

I drafted an urgent email to the team:

Subject: Urgent - Logging Service Performance Issue

Team,

We’re experiencing some performance issues with our logging service today. Our servers are under heavy load and services are starting to fail. I’ve identified that we need to add more instances of the logging service and possibly optimize its code for better concurrency.

Please join me in the monitoring dashboard right after you log in and let’s walk through the steps together:

Add more instances of the logging service.
Review and potentially optimize the code.
Implement a throttling mechanism on our load balancer to prevent future spikes.

Let’s get this resolved as quickly as possible!

Thanks, Brandon

Sending out the email, I sat back and watched the dashboard. The new logs started coming in steadily without the initial spike we had seen earlier. It felt good to have identified the problem and worked towards a solution.

This episode reminded me of how much debugging can feel like a game of whack-a-mole—sometimes you find something obvious, other times it’s a mystery that takes days to unravel. But no matter what, staying calm and methodical always helps in the long run.

As I left for the day, I couldn’t help but think about all the changes happening around us. GitHub had just launched, cloud computing was becoming mainstream, and tech hiring was slowing down due to the economic crisis. We were part of a larger industry that was evolving rapidly, but today’s problem felt like a solid reminder of why we do this work in the first place—because solving real problems is what makes it all worthwhile.

That’s how I wrapped up my day on March 31, 2008, dealing with an unexpected server issue and reflecting on the broader tech landscape.