$ cat post/the-old-server-hums-/-we-ran-it-on-bare-metal-once-/-i-miss-that-old-term.md

22JUL13

the old server hums / we ran it on bare metal once / I miss that old term

A Day in Debugging: Docker Containers Gone Wild

July 22, 2013

Today was one of those days that could have been a poster child for “Do Things That Don’t Scale.” I woke up with my usual pre-coffee groggy head, but as the caffeine kicked in, I realized we had a serious issue on our hands. Our team was using Docker containers to manage our application infrastructure, and it wasn’t behaving well.

The Setup

We were running multiple services across several hosts, each service in its own container. This was all part of our move towards more flexible deployment strategies. We’d been excited about Docker because it promised better isolation between services and easier rollouts. But now, a simple restart on one host had cascaded into a full-blown network storm.

The Storm

We were using etcd to manage the service discovery for our containers. Every container was supposed to register with an etcd instance running somewhere in the cluster. On one of our hosts, a container failed and restarted, causing it to ping etcd repeatedly. This overloaded etcd, which then became unresponsive, leading to more services timing out.

The logs were clear: “100% CPU usage” on the etcd node. But what we didn’t see was the sheer volume of requests coming in—thousands per second, from containers that had no business touching etcd. It was a classic case of the old adage about how easy it is to forget the basics when you’re moving too fast.

The Debug

I spent most of my day digging into this issue. I started by adding more logging around our container startup scripts and the communication with etcd. We were using fleet for managing containers, so I had to look at both the client and server-side configurations.

After a few iterations, I noticed that there was a race condition in how we registered services. Our containers weren’t waiting long enough before re-registering after they started up. This caused a flood of registrations, overwhelming etcd. Once I fixed this with a simple delay script, things started to settle down.

But the real lesson here is about understanding the implications of your tech choices. Docker and etcd were tools that promised great flexibility, but we had to understand their limitations and how they interacted in our environment. We made some changes to how containers were registered, ensuring better timing and limiting the number of simultaneous requests.

The Aftermath

By the end of the day, I felt like I’d just run a marathon. But it was satisfying to solve a problem that had been causing issues for weeks. It’s moments like these that remind me why I love debugging—the thrill of finding that one piece of code or configuration setting that fixes everything.

This experience also solidified my belief in the importance of thorough testing and understanding the full lifecycle of your tech stack. The tech world moves fast, but it’s crucial to step back and ask yourself what you really need before diving in with tools like Docker containers.

As I closed out the day, I couldn’t help but think about all the other “things that don’t scale” we might encounter if we keep pushing the boundaries of our infrastructure. But for now, at least one storm was over, thanks to a good dose of caffeine and some well-placed debugging scripts.