$ cat post/a-race-condition-/-the-load-average-climbed-alone-/-the-pod-restarted.md

a race condition / the load average climbed alone / the pod restarted


Title: July 7, 2014 – A Tale of Containers and Debugging


July 7, 2014 was a typical workday at my company. We were in the midst of our microservices journey, and Docker containers had just started to make waves within the tech community. I remember sitting down to debug a particularly pesky issue that was plaguing one of our services running inside Docker.

The Setup

We were using Docker for containerization, which was all the rage back then. Our application stack consisted of multiple microservices, each serving its own purpose and communicating through various protocols. We had a few services running on CoreOS, managed with fleet, and everything seemed to be humming along smoothly… until it didn’t.

The Problem

One morning, I got a call from our operations team. They reported that one of the services was failing to start up properly. After some quick triage, we discovered that the service would occasionally fail during the startup sequence when running in Docker containers, but worked fine when run locally on a developer machine.

The Hunt

I decided to dive into the code and logs. The error messages were cryptic, pointing towards some kind of dependency issue. I spent hours stepping through the initialization logic with a debugger, trying to pinpoint the exact point where things went south. It was frustrating because the service behaved differently on different machines, making it hard to reproduce the problem consistently.

Docker vs CoreOS

I recalled that we were using Docker images built for our development environment and then deployed on CoreOS nodes managed by fleet. One of my first hypotheses was that there might be a difference in the environment variables or file system permissions between these two environments. I spent some time comparing the Dockerfile and the CoreOS setup, but still couldn’t nail down the root cause.

The Breakthrough

After several dead ends, it hit me—could it be related to DNS resolution? We were relying heavily on internal services for configuration data, and maybe there was a networking issue between our containers. I decided to do an experiment: I would set up a simple echo server inside Docker and try connecting from another container to see if the problem persisted.

To my surprise, the connection worked flawlessly! This pointed strongly towards a networking issue within the CoreOS cluster or the way we were handling DNS resolution for internal services.

The Solution

I reached out to our network ops team, who confirmed that they had recently updated their firewall rules. It seems one of these changes had inadvertently blocked the communication between containers in different namespaces. After adjusting the firewall settings, everything started working as expected.

Reflections

This experience taught me a valuable lesson about container networking and the importance of thoroughly testing your application stack across different environments. It was also a reminder that sometimes the simplest solutions can be the hardest to find, especially when dealing with seemingly complex issues like Docker and CoreOS interactions.

Back in 2014, as Docker was just starting its mainstream adoption journey, these kinds of issues were common but still frustrating. I’m glad we were able to resolve it quickly and learn from it.


That’s my take on a typical day debugging containers back when they were the talk of the town. It’s always interesting looking back at how far we’ve come in just a few short years!