Debugging the Great MySQL Meltdown

February 27, 2006. Another day, another bug to chase down in our beloved LAMP stack. Back then, I was a young sysadmin at a small startup that was just starting to see its user base grow exponentially. We were riding the wave of Web 2.0 excitement, with everyone clamoring for features and new functionality every other week.

Our primary database backend was MySQL, running on an aging server with Xen hypervisor. Everything was going well until one Friday afternoon when our entire service went down. It felt like a major panic button had been pushed. Users were hitting the dreaded 503 Service Unavailable error, and we needed to figure out what hit us.

The Initial Smoke Signals

As soon as I got the alert, I rushed over to my desk. The first thing I did was check our monitoring system. It showed that both CPU usage and memory were within normal limits, but there was an anomaly with disk activity—skyrocketing read/write operations. That was a clear sign of a database problem.

I quickly logged into the server via SSH and started tailing the MySQL error logs. The first thing I saw made my heart sink: “Can’t connect to local MySQL server through socket.” This meant that the MySQL daemon had crashed, and our application layer was choking on this failure.

Digging Deeper

To understand why MySQL went down, I needed more data. I ran a top command and noticed that mysqld was eating up about 80% of my CPU time. That’s when I decided to dig into the process with strace. Strace is one of those tools you can never live without once you know how to use it.

Running strace -p $(pgrep mysqld) gave me a glimpse into what MySQL was trying to do just before it crashed. The output was dense, but I could see repeated attempts to lock and unlock tables, which hinted at some kind of deadlock or concurrency issue.

The Concurrency Mystery

To figure out the exact cause, I needed more context. I decided to enable general query logging in MySQL by setting general_log = 1 in the my.cnf file. This would give me a chronological record of every SQL statement executed on the server. After restarting mysqld with these changes, I waited for the next crash.

A few hours later, we had another outage. This time around, reviewing the logs was like reading a detective novel. The last queries before the crash were all SELECT statements on the same table. It seemed that a single transaction held an exclusive lock on this table for too long, blocking other transactions from making any changes.

A Race Against Time

Now came the tricky part: identifying which transaction was causing the issue without breaking anything in production. I knew that simply restarting mysqld might not solve the problem; it could just move the deadlock to a different query. Instead, I decided to find the exact transaction and kill it using mysqladmin kill.

I wrote a quick Python script to parse the logs for the most recent transactions on the table and then used mysqladmin to terminate them. It was a gamble, but we had no other choice. After running the script, the server came back up almost immediately.

Lessons Learned

The crash taught us several important lessons:

Logging is Crucial: General query logging gave us invaluable insights into what MySQL was doing.
Concurrency Management: We needed to review our database schema and application logic for potential deadlocks.
Scripting Saves the Day: Writing quick scripts to automate debugging processes can be a lifesaver.

As I sat back in my chair, the relief of having resolved the issue washed over me. This little adventure taught me that as much as we rely on tools like MySQL, it’s our responsibility to understand how they work and what can go wrong. The next time something goes sideways, I’ll be better prepared to handle it.

That was a day in 2006, and it’s one of many stories from those formative years that shaped my approach to sysadmin work. Debugging the great MySQL meltdown not only helped us get back online but also solidified some key practices we still use today.