$ cat post/debugging-a-mysterious-hang-in-our-production-service.md

Debugging a Mysterious Hang in Our Production Service


August 13, 2007 was just another day for the team at my startup. We were still using what felt like an old-school tech stack—Python on top of Apache with some homegrown Ruby scripts sprinkled in. GitHub hadn’t launched yet (shocking, I know), but we were already feeling the pull towards version control and automated testing.

It was mid-afternoon when the ops team reported a weird hang in one of our production services. The service in question handled payments for our online marketplace. Our users couldn’t place orders or pay for anything; everything seemed to be frozen, and we had no idea why.

I grabbed my laptop from under the desk and headed over to the server room, where the servers were still housed in a colocation facility. My colleague Sam was already there, peering into the logs on one of our production machines. I saw him typing commands to tail the log file—something like:

tail -f /var/log/apache2/access.log

I noticed he was looking for any unusual patterns or repeated messages that might hint at what was going wrong. We both knew we had a serious issue, and our users were likely pissed off.

After several minutes of staring at the log, Sam let out an exasperated sigh: “It’s just requests stacking up. Maybe it’s a network issue?” He tried running netstat -tunlp to check for any open ports or connections that might be stuck.

I decided to take over and ran:

ps aux | grep python

to see if the Python processes were still alive. They appeared to be, but there was no obvious sign of what they were doing. I tried sending a SIGINT signal with kill -2 PID but nothing changed. The process just kept hanging.

I knew we needed to dig deeper. We both agreed that this wasn’t a simple network issue or an Apache problem; it seemed more like our application itself was the bottleneck. Sam suggested using strace on one of the hung processes:

sudo strace -p PID -s 1024

This gave us a lot of information about what the process was trying to do, but still, no clear answer.

Just as we were debating our next move, the phone rang. It was our support team calling with reports from users complaining that they couldn’t place orders or pay for anything. The tension in the room grew thicker.

We decided to try a more drastic approach: restarting the service. I wrote up a quick script to gracefully restart the application and sent it over to Sam, who ran:

kill -15 PID && sleep 30 && kill -9 PID

We watched the screen closely as the script executed. After about 45 minutes of suspense, we saw that the requests started coming back down. Success!

However, we knew this wasn’t a long-term solution. We needed to understand what caused the hang and fix it properly. Sam and I spent the next few hours analyzing the codebase, running tests, and trying to simulate the conditions under which the issue occurred.

By the end of that day, we managed to pin down the problem: a race condition in our payment processing logic where two simultaneous requests were conflicting with each other. After adding some synchronization primitives (locks), we deployed the fix and our service was back to normal.

Looking back on it now, this event felt like an old-school debug session from the pre-cloud era—no fancy tools or AWS auto-scaling groups. But it taught us a valuable lesson: no matter how advanced your technology stack is, you still need to understand how things work under the hood and be ready for the unexpected.

That night, I couldn’t shake off the feeling that if we had been using modern tools like Hadoop or AWS EC2/S3, debugging might have been a bit easier. Nonetheless, it was a good reminder that sometimes, the simple tools can still get the job done when used correctly.