$ cat post/packet-loss-at-dawn-/-a-midnight-pager-i-still-hear-/-the-build-artifact.md

15AUG16

packet loss at dawn / a midnight pager I still hear / the build artifact

Debugging Docker in Production: A Day in Hell

August 15, 2016. This is the day I spent staring at a wall with a team of developers trying to figure out why our new microservices architecture was spitting out errors left and right.

It all started last month when we decided it was time to jump on the Docker bandwagon. Our old VMs were getting tired, and Kubernetes seemed like the future. We migrated everything over and everything looked rosy—until the inevitable happened: a production outage.

The Setup

We had been using Terraform 0.x for our infrastructure as code, which at the time felt pretty solid. But now, every single microservice we deployed was failing with cryptic errors. Prometheus wasn’t showing anything out of the ordinary, and Grafana graphs were flatlining. We knew something was wrong, but what?

The Investigation

I started by checking our Kubernetes clusters. They seemed fine—no resource exhaustion, no pod evictions. But then I noticed that every single container was spitting out errors like this:

docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec format error".

This error sounded like something to do with the Docker image, but how could a perfectly fine image suddenly start failing?

The Debugging Journey

I spent hours pulling down our images and running them locally. The local containers ran just fine, so it had to be Kubernetes or Terraform. I decided to dig deeper into the Terraform configuration for these services.

After a few iterations, we discovered that one of our modules was using a specific image tag format: v${version} where ${version} was supposed to be set by our CI/CD pipeline. However, sometimes the version wasn’t being properly resolved, and we were trying to run an empty or invalid Docker image.

The Fix

We fixed it by ensuring that the version placeholder was always correctly replaced before passing the tag to Kubernetes. This change took a few iterations, but eventually, everything started working as expected.

Reflections on the Era

Back then, container orchestration and infrastructure as code were still relatively new. We didn’t have tools like Helm or Istio to help us manage our deployments, so we relied heavily on our own custom scripts and workflows. GitOps was just a term starting to gain traction, and Terraform 0.x was the version we had to work with.

The Dropbox hack and the PowerShell open-sourcing were interesting stories, but they weren’t directly relevant to what we were dealing with that day. The real challenge for us was getting our new tech stack to play nicely together in a production environment.

Conclusion

This experience taught me the importance of thorough testing in a staging environment before pushing changes to production. It also highlighted the need for robust error handling and fallback mechanisms in our CI/CD pipelines. Looking back, I’m glad we were able to identify and resolve the issue, but it was a stark reminder that even with all the fancy tools out there, basic principles like ensuring dependencies are correctly managed still matter.

Debugging Docker containers is not glamorous work, but when you’re dealing with it in production, it feels anything but. It’s just one of those days where everything seems to go wrong, and you’re left wondering if your career has become a series of troubleshooting sessions. But that’s what we do—find the bugs, fix them, and move on to the next challenge.

This was my day in hell, but it sure taught me valuable lessons for future projects.